Java过滤Unicode
来源:互联网 发布:放弃中国国籍 知乎 编辑:程序博客网 时间:2024/06/08 08:58
我们在解析XML文件时,会碰到程序发生以下一些异常信息:
引用
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".
引用
An invalid XML character (Unicode: 0x1d) was found in the CDATA section.
这些错误的发生是由于一些不可见的特殊字符的存在,而这些字符对于XMl文件来说又是非法的,所以XML解析器在解析时会发生异常,官方定义了XML的无效字符分为三段:
0x00 - 0x08
0x0b - 0x0c
0x0e - 0x1f
解决方法是:在解析之前先把字符串中的这些非法字符过滤掉:
string.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "")
测试代码:TestXmlInvalidChar.java
package michael.xml;
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
/**
* @author michael
*
*/
public class TestXmlInvalidChar {
/**
* @param args
*/
public static void main(String[] args) {
// 测试的字符串应该为:<r><c d="s" n="j"></c></r>
// 正常的对应的byte数组为
byte[] ba1 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 34, 62, 60, 47, 99, 62, 60, 47, 114,
62 };
System.out.println("ba1 length=" + ba1.length);
String ba1str = new String(ba1);
System.out.println(ba1str);
System.out.println("ba1str length=" + ba1str.length());
System.out.println("-----------------------------------------");
// 和正常的byte 数组相比 多了一个不可见的 31
byte[] ba2 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 31, 34, 62, 60, 47, 99, 62, 60, 47,
114, 62 };
System.out.println("ba2 length=" + ba2.length);
String ba2str = new String(ba2);
System.out.println(ba2str);
System.out.println("ba2str length=" + ba2str.length());
System.out.println("-----------------------------------------");
try {
DocumentBuilderFactory dbfactory = DocumentBuilderFactory
.newInstance();
dbfactory.setIgnoringComments(true);
DocumentBuilder docBuilder = dbfactory.newDocumentBuilder();
// 过滤掉非法不可见字符 如果不过滤 XML解析就报异常
String filter = ba2str.replaceAll(
"[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");
System.out.println("过滤后的length=" + filter.length());
ByteArrayInputStream bais = new ByteArrayInputStream(filter
.getBytes());
Document doc = docBuilder.parse(bais);
Element rootEl = doc.getDocumentElement();
System.out.println("过滤后解析正常 root child length="
+ rootEl.getChildNodes().getLength());
} catch (Exception e) {
e.printStackTrace();
}
}
}
测试代码运行结果如下:
引用
ba1 length=26
<r><c d="s" n="j"></c></r>
ba1str length=26
-----------------------------------------
ba2 length=27
<r><c d="s" n="j"></c></r>
ba2str length=27
-----------------------------------------
过滤后的length=26
过滤后解析正常 root child length=1
对比可见,byte数组及字符串的长度前后是不一样的,但打印到控制台显示的结果却是一样的。同样过滤之后的字符串长度是有变化的。
参考:http://sjsky.iteye.com/blog/1055063
http://www.blogjava.net/fingki/archive/2008/09/04/226969.html
--复旦检索 图书馆报错:
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".
String xmlCode2 = HttpClientUtil.getWebInfoByHttpClientGetMethodGBK(searchURL); // 抓取网页
xmlCode2 = xmlCode2.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");//过滤Unicode
引用
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".
引用
An invalid XML character (Unicode: 0x1d) was found in the CDATA section.
这些错误的发生是由于一些不可见的特殊字符的存在,而这些字符对于XMl文件来说又是非法的,所以XML解析器在解析时会发生异常,官方定义了XML的无效字符分为三段:
0x00 - 0x08
0x0b - 0x0c
0x0e - 0x1f
解决方法是:在解析之前先把字符串中的这些非法字符过滤掉:
string.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "")
测试代码:TestXmlInvalidChar.java
package michael.xml;
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
/**
* @author michael
*
*/
public class TestXmlInvalidChar {
/**
* @param args
*/
public static void main(String[] args) {
// 测试的字符串应该为:<r><c d="s" n="j"></c></r>
// 正常的对应的byte数组为
byte[] ba1 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 34, 62, 60, 47, 99, 62, 60, 47, 114,
62 };
System.out.println("ba1 length=" + ba1.length);
String ba1str = new String(ba1);
System.out.println(ba1str);
System.out.println("ba1str length=" + ba1str.length());
System.out.println("-----------------------------------------");
// 和正常的byte 数组相比 多了一个不可见的 31
byte[] ba2 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115,
34, 32, 110, 61, 34, 106, 31, 34, 62, 60, 47, 99, 62, 60, 47,
114, 62 };
System.out.println("ba2 length=" + ba2.length);
String ba2str = new String(ba2);
System.out.println(ba2str);
System.out.println("ba2str length=" + ba2str.length());
System.out.println("-----------------------------------------");
try {
DocumentBuilderFactory dbfactory = DocumentBuilderFactory
.newInstance();
dbfactory.setIgnoringComments(true);
DocumentBuilder docBuilder = dbfactory.newDocumentBuilder();
// 过滤掉非法不可见字符 如果不过滤 XML解析就报异常
String filter = ba2str.replaceAll(
"[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");
System.out.println("过滤后的length=" + filter.length());
ByteArrayInputStream bais = new ByteArrayInputStream(filter
.getBytes());
Document doc = docBuilder.parse(bais);
Element rootEl = doc.getDocumentElement();
System.out.println("过滤后解析正常 root child length="
+ rootEl.getChildNodes().getLength());
} catch (Exception e) {
e.printStackTrace();
}
}
}
测试代码运行结果如下:
引用
ba1 length=26
<r><c d="s" n="j"></c></r>
ba1str length=26
-----------------------------------------
ba2 length=27
<r><c d="s" n="j"></c></r>
ba2str length=27
-----------------------------------------
过滤后的length=26
过滤后解析正常 root child length=1
对比可见,byte数组及字符串的长度前后是不一样的,但打印到控制台显示的结果却是一样的。同样过滤之后的字符串长度是有变化的。
参考:http://sjsky.iteye.com/blog/1055063
http://www.blogjava.net/fingki/archive/2008/09/04/226969.html
--复旦检索 图书馆报错:
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f".
String xmlCode2 = HttpClientUtil.getWebInfoByHttpClientGetMethodGBK(searchURL); // 抓取网页
xmlCode2 = xmlCode2.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "");//过滤Unicode
- Java过滤Unicode
- java使用unicode过滤emoji表情
- python过滤unicode控制字符
- Java获取字符的Unicode编码以及如何过滤特殊字符ZWNJ
- java unicode
- java unicode
- Java unicode
- java unicode
- java unicode
- 根据unicode编码过滤字符串中的字符
- 文件过滤驱动--一个Unicode操作的Lib
- 过滤 外文unicode文本中字符的代码
- (手机表情过滤) Emoji与unicode特殊字符的处理
- java命令转换unicode
- java unicode endian
- java unicode编码
- Java: unicode 转 gb2312
- Java:Unicode简介
- 学之者生,用之者死——ACE历史与简评
- 处理大数据分页下拉列表显示方式
- Java卡应用开发其实并不难(3)
- C/C++中System函数的一点说明
- 分布式系统的工程化开发方法
- Java过滤Unicode
- JQuery选择器
- [EF在VS2010中应用Entity framework与MySQL
- C# List排序Sort
- 从Trie树(字典树)谈到后缀树
- java 字符串截取和替换
- 谈一谈网络编程学习经验(06-08更新)
- JQuery语法
- java 窗体中 table的 属性设置