TFHpple使用方法

来源：互联网发布：php代码批量替换工具编辑：程序博客网时间：2024/06/06 17:29

TFHpple是OC解析HTML第三方库。GitHub地址：https://github.com/topfunky/hpple/tree/master

使用前需要：

1.添加libxml2.2.dylib

2.设置Header Search Paths: (1)debug为/usr/include/libxml2; (2)release为${SDKROOT}/usr/include/libxml2

#import "TFHpple.h"

NSData  * data      = [NSData dataWithContentsOfFile:@"index.html"];

TFHpple * doc       = [[TFHpple alloc] initWithHTMLData:data];
NSArray * elements  = [doc search:@"//a[@class='sponsor']"];

TFHppleElement * element = [elements objectAtIndex:0];
[e text];                       // The text inside the HTML element (the content of the first text node)
[e tagName];                    // "a"
[e attributes];                 // NSDictionary of href, class, id, etc.
[e objectForKey:@"href"];       // Easy access to single attribute
[e firstChildWithTagName:@"b"]; // The first "b" child node

使用Swift解析实例

示例URL:http://computer.iscnu.net/Welcome/Login，用Chrome显示源代码。

重点是xpathQueryString的写法，我代码中let xpathQueryString = "//p[@class='cl']/label/a[@href='/Welcome/Register.html']" ,下面一一道清其写法。

//p 表示要找的节点名称是p

[@class='cl'] 可省去，加上则表示要找的是"class"这个属性是"cl"的p节点

/label 表示找p节点下名为label的子节点

println(firstMatching)后可以看到：

可见，一个节点有nodeAttributeArray、nodeChildArray和nodeName三部分组成。nodeAttributeArray由attributeName和nodeName组成；nodeChildArray由nodeContent和nodeName组成，nodeChildArray中可以再嵌入nodeChildArray；最后的nodeName就是要找的节点名称，注意，我们可能会看到多个nodeName，要注意区分。

将 xpathQueryString ="//p[@class='cl']" ，然后println(firstMatching)，可看到复杂一点的结果：

对应的HTML：

分析：（这样看应该清楚一点）

nodeAttributeArray = (

{ attributeName = class; nodeContent = cl; }

);

nodeChildArray = (

{ nodeContent = "\n\t\t\t\t"; nodeName = text; }

{ nodeAttributeArray =（

{ attributeName = class; nodeContent = username; },

{ attributeName = tabindex; nodeContent = 1; },

{ attributeName = type; nodeContent = text; },

{ attributeName = name; nodeContent = account; },

{ attributeName = value; nodeContent = ""; },

{ attributeName = placeholder; nodeContent = ""; }

);

nodeName = input;

}

{ nodeContent = "\n\t\t\t\t"; nodeName = text; }

{ nodeChildArray = (

{ nodeAttributeArray = (

{ attributeName = href; nodeContent = "/Welcome/Register.html"; }

);

nodeChildArray = (

{ nodeContent = "\U7528\U6237\U6ce8\U518c"; nodeName = text; },

);

nodeName = a;

}

);

nodeName = label;

}

);

nodeName = p;

有几点规律：

1.有nodeAttributeArray，就一定有nodeName；

2.在nodeAttributeArray下，有AttributeName和nodeContent；

3.在nodeAttributeArray中可以插入nodeChildArray；

4.在nodeChildArray下有nodeContent和nodeName（这个nodeName是子节点的名称），可以再嵌入nodeAttributArray

5.最简单的nodeChildArray是换行退格，其nodeName是text。

0 0