爬虫系列16.urlparse模块

来源:互联网 发布:枪械3d模型数据 编辑:程序博客网 时间:2024/06/06 02:43
res = urlparse.urlparse(url,scheme,allow_fragments)返回一个6-tuple,类型是ParseResult(scheme, netloc, path, params, query, fragment)ParseResult类还有几个常用方法:res.usernameres.passwordres.hostnameres.portres.geturl()urlparse.urldefrag(url)urlparse.urlunparse(data)返回一个stringdata必须是six-item iterableres = urlparse.urlsplit(url,scheme,allow_fragments)返回一个5-tuple,类型是.SplitResult(scheme, netloc, path, query, fragment)这里的path相当于urlparse的path+params,具体见例子urlparse.urlunsplit(data)返回一个stringdata必须是five-item iterableurlparse.urljoin(base, url, allow_fragments)这个函数比较复杂,不同的数据得出的结果大不一样,而且容易出现错误,不建议用这个函数,详见下面几个例子    1. import urlparse    2.    3. url = "https://www.google.com.hk:8080/home/search;12432?newwi.1.9.serpuc#1234"    4.    5. r = urlparse.urlparse(url)    6. print r#ParseResult(scheme='https', netloc='www.google.com.hk:8080', path='/home/search', params='12432', query='newwi.1.9.serpuc',fragment='1234')    7. print r.port,r.hostname#8080 www.google.com.hk    8. print r.geturl()https://www.google.com.hk:8080/home/search;12432?newwi.1.9.serpuc#1234    9. r = urlparse.urlsplit(url)    10. print rSplitResult(scheme='https', netloc='www.google.com.hk:8080', path='/home/search;12432', query='newwi.1.9.serpuc',fragment='1234')    11. parts = ["http","www.facebook.com","/home/email","132","parts","md5=?"]    12. print urlparse.urlunparse(parts)http://www.facebook.com/home/email;132?parts#md5=?    13. print urlparse.urlunsplit(parts[0:5])http://www.facebook.com/home/email?132#parts    14. base = "http://baidu.com/home"    15. url = "index.html"    16. print urlparse.urljoin(base, url)http://baidu.com/index.html    17. base = "http://baidu.com/home/action.jsp"    18. url = "index.html"    19. print urlparse.urljoin(base, url)http://baidu.com/home/index.html    20. base = "http://baidu.com/home/action.jsp"    21. url = "/index.html"    22. print urlparse.urljoin(base, url)http://baidu.com/index.html    23. base = "http://baidu.com/home/action.jsp"    24. url = "../../index.html"    25. print urlparse.urljoin(base, url)http://baidu.com/../index.html