gethostbyname(),以及相关的数据处理流程

来源：互联网发布：嘻哈知乎编辑：程序博客网时间：2024/04/29 10:05

gethostbyname() -- 用域名或主机名获取IP地址

#include <netdb.h>

#include <sys/socket.h>

#include <unistd.h>

#include <sys/types.h>

#include <netdb.h>

#include <netinet/in.h>

#include <stdlib.h>

#include <netinet/in.h>

#include <arpa/inet.h>

#include <stdio.h>

struct hostent *gethostbyname(const char *name);

这个函数的传入值是域名或者主机名，例如"www.google.cn"等等。传出值，是一个hostent的结构。如果函数调用失败，将返回NULL。

struct hostent

{

char *h_name;

char **h_aliases;

int h_addrtype;

int h_length;

char **h_addr_list;

#define h_addr h_addr_list[0]

};

hostent->h_name

表示的是主机的规范名。例如www.google.com的规范名其实是www.l.google.com。

hostent->h_aliases

表示的是主机的别名.www.google.com就是google他自己的别名。有的时候，有的主机可能有好几个别名，这些，其实都是为了易于用户记忆而为自己的网站多取的名字。

hostent->h_addrtype

表示的是主机ip地址的类型，到底是ipv4(AF_INET)，还是pv6(AF_INET6)

hostent->h_length

表示的是主机ip地址的长度

hostent->h_addr_lisst

表示的是主机的ip地址，注意，这个是以网络字节序存储的。千万不要直接用printf带%s参数来打这个东西，会有问题的哇。所以到真正需要打印出这个IP的话，需要调用inet_ntop()。

const char *inet_ntop(int af, const void *src, char *dst, socklen_t cnt) ：

这个函数，是将类型为af的网络地址结构src，转换成主机序的字符串形式，存放在长度为cnt的字符串中。返回指向dst的一个指针。如果函数调用错误，返回值是NULL。

#include <netdb.h>

#include <sys/socket.h>

#include <stdio.h>

int main(int argc, char **argv)

{

char *ptr, **pptr;

struct hostent *hptr;

char str[32];

ptr = argv[1];

if((hptr = gethostbyname(ptr)) == NULL)

{

printf(" gethostbyname error for host:%s\n", ptr);

return 0;

}

printf("official hostname:%s\n",hptr->h_name);

for(pptr = hptr->h_aliases; *pptr != NULL; pptr++)

printf(" alias:%s\n",*pptr);

switch(hptr->h_addrtype)

{

case AF_INET:

case AF_INET6:

pptr=hptr->h_addr_list;

for(; *pptr!=NULL; pptr++)

printf(" address:%s\n",

inet_ntop(hptr->h_addrtype, *pptr, str, sizeof(str)));

printf(" first address: %s\n",

inet_ntop(hptr->h_addrtype, hptr->h_addr, str, sizeof(str)));

break;

default:

printf("unknown address type\n");

break;

}

return 0;

}

编译运行

-----------------------------

# gcc test.c

# ./a.out www.baidu.com

official hostname:www.a.shifen.com

alias:www.baidu.com

address:121.14.88.11

address:121.14.89.11

first address: 121.14.88.11

注意：

Unix/Linux下的gethostbyname函数常用来向DNS查询一个域名的IP地址。由于DNS的递归查询，常常会发生gethostbyname函数在查询一个域名时严重超时。而该函数又不能像connect和read等函数那样通过setsockopt或者select函数那样设置超时时间，因此常常成为程序的瓶颈。有人提出一种解决办法是用alarm设置定时信号，如果超时就用setjmp和longjmp跳过gethostbyname函数（这种方式我没有试过，不知道具体效果如何）。在多线程下面，gethostbyname会一个更严重的问题，就是如果有一个线程的gethostbyname发生阻塞，其它线程都会在gethostbyname处发生阻塞。我在编写爬虫时也遇到了这个让我疑惑很久的问题，所有的爬虫线程都阻塞在gethostbyname处，导致爬虫速度非常慢。在网上google了很长时间这个问题，也没有找到解答。今天凑巧在实验室的googlegroup里面发现了一本电子书"Mining the Web - Discovering Knowledge from Hypertext Data",其中在讲解爬虫时有下面几段文字： Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers. In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on. 大意是说unix的gethostbyname无法处理在并发程序下使用，这是先天的缺陷是无法改变的。大型爬虫往往不会使用gethostbyname，而是实现自己独立定制的DNS客户端。这样可以实现DNS的负载平衡，而且通过异步解析能够大大提高DNS解析速度。DNS客户端往往用UDP实现，可以在爬虫爬取网页前提前解析URL的IP。文章中还提到了一个开源的异步DNS库adns，主页是http://www.chiark.greenend.org.uk/~ian/adns/ 从以上可看出，gethostbyname并不适用于多线程环境以及其它对DNS解析速度要求较高的程序。

0 0