改写版的tinyxml，HTML解析器

来源：互联网发布：视频资源搜索软件编辑：程序博客网时间：2024/05/29 07:35

改写版的tinyxml，HTML解析器

某个公司的笔试题要求提示用tinyxml解析HTML。

文章开始前，先吐槽一下，哪里有多达几个月的笔试题啊~~我的时间很贵的，平常在公司，没钱我是从不加班的╮(╯▽╰)╭

好了，现在开始本文……

tinyxml只能解析XML，但是HTML却不一定遵守XML规范。

具体表现在两方面：

1、HTML中存在不成对出现的标签，例如:<meta xxx = "xafaf" >

2、HTML中的script脚本可以包含小于号"<"和“>"，但是，tinyxml把"<"当成一个标签的开始点

我们针对上面两点对tinyxml进行改写：

一、1、TiXmlelement::Parse() 修正为可以识别不成对的标签；

在读取完TiXmlelement的属性值，开始读取TiXmlElenment的value之前修改成如下代码（约在tinyxmlparser.cppの1103行）：

else if ( *p == '>' )

{

/*********************************************

* 2012/07/24 wanminfei add for HTML anylize start

* In the HTML anylize below tag shoule be conside

* 1.<meta xxxx >

* 2.<area xxxx >

* 3.<img xxxxx >

* 4.<input xxx >

* they all has not the end tag like this </xxxx>

*********************************************/ if(StringEqual(value.c_str(),"meta",false,encoding)||StringEqual(value.c_str(),"area",false,encoding)||StringEqual(value.c_str(),"img",false,encoding)||StringEqual(value.c_str(),"input",false,encoding))

{

++p;

return p;

}

/*********************************************

* 2012/07/24 wanminfei add for HTML anylize end

*********************************************/

// Done with attributes (if there were any.)

// Read the value -- which can include other

// elements -- read the end tag, and return.

++p;

p = ReadValue( p, data, encoding );// Note this is an Element method, and will set the error if one happens.

if ( !p || !*p ) {

// We were looking for the end tag, but found nothing.

// Fix for [ 1663758 ] Failure to report error on bad XML

if ( document ) document->SetError( TIXML_ERROR_READING_END_TAG, p, data, encoding );

return 0;

}

// We should find the end tag now

// note that:

// </foo > and

// </foo>

// are both valid end tags.

二、在tinyxml中增加对<script>标签的特殊处理

1、TiXmlText中增加是否为<script>的text值判断变量（约在tinyxml.hの1279行）

private:

bool cdata;// true if this should be input and output as a CDATA style text element

//2012/07/24 wanminfei add for HTML analysize start

bool script_bool;

//2012/07/24 wanminfei add for HTML analysize end

2、在TiXmlText的构造函数中加入script_bool的初始化，在此略去。

3、在ReadValue的TiXmlText的创建位置增加是否为<script>的判断（约在tinyxmlparser.cpp的1205行）

while ( p && *p )

{

if ( *p != '<' )

{

// Take what we have, make a text element.

TiXmlText* textNode = new TiXmlText( "" );

/*******************************************

* 2012/07/24 wanminfei add for HTML analysize start

* in the HTML analysize ,the tag about <script>

* is diffirent ,becouse below:

* <script> a<4 </script>

* < and > can be use in it.

* so we should deal it especially

******************************************/

if(StringEqual(value.c_str(),"script",true,encoding))

{

textNode->script_bool =true;

}

/*******************************************

* 2012/07/24 wanminfei add for HTML analysize end

******************************************/

4、对TiXmlText的Parse增加对<script>标签的特定处理（tinyxmlparser.cppの1573行左右）；

else

{

bool ignoreWhite = true;

//2012/07/24 wanminfei add for HTML analysize start

//const char* end = "<";

char* end = "<";

if(this->script_bool)

{

end = "</";

p = ReadText( p,&value,ignoreWhite,end,true,encoding);

//for "/" of the "</" keeps back.

p--;

}

else

{

p = ReadText( p, &value, ignoreWhite, end, false, encoding );

}

//2012/07/24 wanminfei add for HTML analysize end.

if ( p && *p )

return p-1;// don't truncate the '<'

return 0;

}

改版过后的tinyxml库，基本可以提取出HTML的信息了……

在此，再次吐糟，笔试题有这样的么~%>_<%