跑了跑larbin代码，有几点遇到的问题，做个笔记。

来源：互联网发布：儿童台灯知乎编辑：程序博客网时间：2024/05/20 15:53

问题1，larbin如何解析网页，并提取url。

用的时候发现larbin抓取的url不够，有些标签里的东西都没抓出来，比如iframe,script 等等这些标签的url。下面的目的很简单，就是要让larbin能抓取iframe,script。

大家都知道larbin采用的单线程非阻塞的模型，每一条连接在larbin中就对应一个Connexion类型的对象。这个Connexion类型的对象中有一个file类型的成员指针parser，这个就是解析用的。

具体说下这个file类，file只是一个虚基类，真正有用的是它的两个子类html类和robots类，前者是解析网页的，后者是解析robots.txt文件的，当初始化Connexion的时候，会给parser赋上一个对象，html类型的或robots类型。不管解析什么，解析的时候都调用endInput ()接口（这里可以复习下c++的多态性，呵呵）

解析robots的就不说了，说下html类的部分，在html类的endInput 中会先处理一下30x这种错误，然后进入最重要的部分parseHtml()，所有的解析部分都在里头。

先看parseHtml ()的代码，说明下posParse这个是网页内容，这个指针会随着解析的过程一直往前移动，直到移动到网页末尾。代码在src/file.cc中
void html::parseHtml () {
while ((posParse=strchr(posParse, '<')) != NULL) {
if (posParse[1] == '!') {
if (posParse[2] == '-' && posParse[3] == '-') {
posParse += 4;
parseComment();
} else {
// nothing...
posParse += 2;
}
} else {
posParse++;
parseTag();
}
}
}
可以看到，它先搜索'<'，然后就开始处理标签，若是注释的就用parseComment()处理（其实就是跳过去了），其他标签就用parseTag()来处理，
在看看parseTag()中选择标签的代码如下：
if (thisCharIs(0, 'a')) { // a href
param = "href";
action = LINK;
posParse++;
} else if (thisCharIs(0, 'l')) {
isTag(thisCharIs(1, 'i') && thisCharIs(2, 'n') && thisCharIs(3, 'k'),
"href", LINK, 4);
} else if (thisCharIs(0, 'b')) { // base href
isTag(thisCharIs(1, 'a') && thisCharIs(2, 's') && thisCharIs(3, 'e'),
"href", BASE, 4);
} else if (thisCharIs(0, 'f')) { // frame src
isTag(thisCharIs(1, 'r') && thisCharIs(2, 'a')
&& thisCharIs(3, 'm') && thisCharIs(4, 'e'),
"src", LINK, 5);
#ifdef IMAGES
} else if (thisCharIs(0, 'i')) { // img src
isTag(thisCharIs(1, 'm') && thisCharIs(2, 'g'), "src", LINK, 3);
#endif // IMAGES
} else {
return;
}

这里解释下，isTag是它定义的宏，目的是要告诉下面的步骤如何处理这个标签，比如第一个else if中 isTag告诉后面，当前标签是“link”,在标签中找“href”字段（param），取出内容作为一个url。最后那个i是要跳过的字符数，就是那个标签的长度，比如frame 就是5.

#define isTag(t, p, a, i) if (t) { \
param = p; \
action = a; \
posParse += i; \
} else { \
posParse++; \
return; \
}

下面代码不贴了，主要就是在parseContent()方法里头查找param的字符串，把后面的取出来作为url，用manageUrl()处理提取的url（比如url去重和把url放到队列里）

可以看到原始的只处理了<a> <link> <base> <frame> <img> 这几个标签，我们要想能解析script和iframe的只需模仿着加入如下即可，
} else if(thisCharIs(0, 'i')){//iframe src
isTag(thisCharIs(1, 'f') && thisCharIs(2, 'r')
&& thisCharIs(3, 'a') && thisCharIs(4, 'm') && thisCharIs(5, 'e') ,
"src", LINK, 6);
} else if (thisCharIs(0, 's')) { // script src
isTag(thisCharIs(1, 'c') && thisCharIs(2, 'r')
&& thisCharIs(3, 'i') && thisCharIs(4, 'p') && thisCharIs(5, 't'),
"src", LINK, 6);

总结下，larbin的页面解析还真是简单，其实要是有高的需求就可以自己写个html::parseHtml ()，用一些xml解析之类的库来处理，只要把提出的url放入manageUrl()就可以了。

问题2，iframe标签的能抓了，script的还是抓不出来

跟了下代码找到如下原因，还是那个src/file.cc文件
#ifdef ANYTYPE
#define checkType() return 0
#elif defined(IMAGES)
#define checkType() if (startWithIgnoreCase("image", area+14)) { \
return 0; \
} else { errorType (); }
#else
#define checkType() errorType()
#endif
int html::verifType () {
if (startWithIgnoreCase("content-type: ", area)) {
// Let's read the type of this doc
if (!startWithIgnoreCase("text/html", area+14)) {
#ifdef SPECIFICSEARCH
if (matchContentType(area+14)) {
interestingSeen();
isInteresting = true;
} else {
checkType();
}
#else // SPECIFICSEARCH
checkType();
#endif // SPECIFICSEARCH
}
}
return 0;
}
可以看到这段检测相应的content-type:字段，只处理是text/html 或 image的，不过如果定义了#ifdef ANYTYPE，那这里就没什么意义了。因为一般js请求返回的content-type:一般是Content-Type:application/x-javascript这样的，所以这里要在option.h中把#ifdef ANYTYPE加上。

问题3，抓取以数字开头的部分地址出现noDNS的错误，比如http://95555.cmbchina.com/这个网站。

这是怎么回事，问题如下。

跟了下程序发现问题在下面这个函数里，site.cc文件
/** Init a new dns query
*/
void NamedSite::newQuery () {
// Update our stats
newId();
if (global::proxyAddr != NULL) {
// we use a proxy, no need to get the sockaddr
// give anything for going on
siteSeen();
siteDNS();
// Get the robots.txt
dnsOK();
} else if (isdigit(name[0])) {
// the name already in numbers-and-dots notation
siteSeen();
if (inet_aton(name, &addr)) {
// Yes, it is in numbers-and-dots notation
siteDNS();
// Get the robots.txt
dnsOK();
} else {
// No, it isn't : this site is a non sense
dnsState = errorDns;
dnsErr();
}
} else {
// submit an adns query
global::nbDnsCalls++;
adns_query quer = NULL;
adns_submit(global::ads, name,
(adns_rrtype) adns_r_addr,
(adns_queryflags) 0,
this, &quer);
}
}

我了个汗啊，这里else if (isdigit(name[0])) 意思是当url第一个字符时数字，就直接认为这已经是ip地址的格式了。。。晕了，这有点草率吧。

改成如下应该就好了：
/** Init a new dns query
*/
void NamedSite::newQuery () {
// Update our stats
newId();
if (global::proxyAddr != NULL) {
// we use a proxy, no need to get the sockaddr
// give anything for going on
siteSeen();
siteDNS();
// Get the robots.txt
dnsOK();
} else if (inet_aton(name, &addr)){ //isdigit(name[0])) {
// the name already in numbers-and-dots notation
siteSeen();
// Yes, it is in numbers-and-dots notation
siteDNS();
// Get the robots.txt
dnsOK();

/*else {

// No, it isn't : this site is a non sense
dnsState = errorDns;
dnsErr();
}*/
} else {
// submit an adns query
global::nbDnsCalls++;
adns_query quer = NULL;
adns_submit(global::ads, name,
(adns_rrtype) adns_r_addr,
(adns_queryflags) 0,
this, &quer);
}
}

问题4，发现当配置文件里明明已经写上了noExternalLinks，还是会抓到其他的站点上去。
找了找原来是处理30x的时候，似乎也有些不妥。如下：
/** parse a line of header (ans 30X) => just look for location
* @return 0 if OK, 1 if we don't want to read the file
*/
int html::parseHeader30X () {
if (posParse - area < 2) {
// end of http headers without location => err40X
errno = err40X;
return 1;
} else {
if (startWithIgnoreCase("location: ", area)) {
int i=10;
while (area[i]!=' ' && area[i]!='\n' && area[i]!='\r'
&& notCgiChar(area[i])) {
i++;
}
if (notCgiChar(area[i])) {
area[i] = 0; // end of url
// read the location (do not decrease depth)
url *nouv = new url(area+10, here->getDepth(), base);
#ifdef URL_TAGS
nouv->tag = here->tag;
#endif // URL_TAGS
manageUrl(nouv, true);
// we do not need more headers
}
errno = err30X;
return 1;
}
}
return 0;
}
这里有一点要注释的是manageUrl(nouv, true);这个函数，如果跟进去会发现，当第二参数是true的时候，就不会检查这个url是不是这个站点的了。等于noExternalLinks无效了。
因为我的需求是，只要配置noExternalLinks了，是这种30x的跳转也不能进到别的站点去，所以做了如下修改：
int html::parseHeader30X () {
if (posParse - area < 2) {
// end of http headers without location => err40X
errno = err40X;
return 1;
} else {
if (startWithIgnoreCase("location: ", area)) {
int i=10;
while (area[i]!=' ' && area[i]!='\n' && area[i]!='\r'
&& notCgiChar(area[i])) {
i++;
}
if (notCgiChar(area[i])) {
area[i] = 0; // end of url
// read the location (do not decrease depth)
url *nouv = new url(area+10, here->getDepth(), base);
#ifdef URL_TAGS
nouv->tag = here->tag;
#endif // URL_TAGS
if(global::externalLinks){
manageUrl(nouv, true);
}else{
manageUrl(nouv, false);
}
// we do not need more headers
}
errno = err30X;
return 1;
}
}
return 0;
}

问题5，有时候抓同一个站点的时候，当连接数设的不大的时候，会出现一段时间速度变成0了，过了会又恢复了的情况。
在main.cc的cron()函数里有一段
// see if we should read again urls in fifowait
if ((global::now % 300) == 0) {
global::readPriorityWait = global::URLsPriorityWait->getLength();
global::readWait = global::URLsDiskWait->getLength();
}
if ((global::now % 300) == 150) {
global::readPriorityWait = 0;
global::readWait = 0;
}
看着意思是每隔300秒就把，URLsDiskWait队列的长度取出来，然后隔150秒又记录为0。
在队列取url的时候又有如下代码。
/* Get the next url
* here is defined how priorities are handled
*/
static bool canGetUrl (bool *testPriority) {
url *u;
if (global::readPriorityWait) {
global::readPriorityWait--;
u = global::URLsPriorityWait->get();
global::namedSiteList[u->hostHashCode()].putPriorityUrlWait(u);
return true;
} else if (*testPriority && (u=global::URLsPriority->tryGet()) != NULL) {
// We've got one url (priority)
global::namedSiteList[u->hostHashCode()].putPriorityUrl(u);
return true;
} else {
*testPriority = false;
// Try to get an ordinary url
if (global::readWait) {
global::readWait--;
u = global::URLsDiskWait->get();
global::namedSiteList[u->hostHashCode()].putUrlWait(u);
return true;
} else {
u = global::URLsDisk->tryGet();
if (u != NULL) {
global::namedSiteList[u->hostHashCode()].putUrl(u);
return true;
} else {
return false;
}
}
}
}

这点一直不明白这个意义是什么，为什么不在canGetUrl（）里直接判断URLsDiskWait是否为空，却要用if (global::readWait)来判断。

我只能暂时如下修改，免得出现一会停一会恢复的情况，修改如下：
// see if we should read again urls in fifowait
if ((global::now % 300) == 0 ) {
global::readPriorityWait = global::URLsPriorityWait->getLength();
global::readWait = global::URLsDiskWait->getLength();
}
if(global::readWait == 0) global::readWait = global::URLsDiskWait->getLength();
if(global::readPriorityWait ==0 ) global::readPriorityWait = global::URLsPriorityWait->getLength();
/*if ((global::now % 300) == 150) {
global::readPriorityWait = 0;
global::readWait = 0;
}*/