Extract HTML Title, Description, Keywords（Chilkat/Python学习二）

来源：互联网发布：mysql limit offaet 编辑：程序博客网时间：2024/06/05 04:18

既然自己要学习Chilkat，那就接着写他的东西吧；

好了，开始吧！
首先你要学习这篇内容你必须了解python语法，python很简单，但是做的事不简单，这也是我学习他的原因；还有你必学安装Chilkat，具体细节去看我的

Getting Started Spidering a Site使用Chilkat（python）练习的一个爬虫（from :http://www.example-code.com）

http://blog.csdn.net/Xiao_Qiang_/archive/2008/08/23/2820293.aspx

一、源码

from extra import  chilkat
# The Chilkat Spider component/library is free.
spider = chilkat.CkSpider()

# The spider object crawls a single web site at a time.  As you'll see
# in later examples, you can collect outbound links and use them to
# crawl the web.  For now, we'll simply spider 10 pages of chilkatsoft.com
spider.Initialize("http://www.vtchina.com/")

# Add the 1st URL:
spider.AddUnspidered("http://www.vtchina.com/")


# Begin crawling the site by calling CrawlNext repeatedly.

for i in range(0,10):

    success = spider.CrawlNext()
    if (success == True):
        # Show the URL of the page just spidered.
        print spider.lastUrl()

        # The HTML META keywords, title, and description are available in these properties:
        print spider.lastHtmlTitle()
        
        info = spider.lastHtmlDescription()
        HtmlDescription = unicode(info,"utf-8")
        print HtmlDescription
        print spider.lastHtmlKeywords()

        # The HTML is available in the LastHtml property
    else:
        # Did we get an error or are there no more URLs to crawl?
        if (spider.get_NumUnspidered() == 0):
            print "No more URLs to spider"
        else:
            print spider.lastErrorText()

    # Sleep 1 second before spidering the next URL.
    spider.SleepMs(1000)

注意我这里是爬网站http://www.vtchina.com/，是一个中文的网站，程序执行下来，语句print spider.lastHtmlTitle()输出的是乱码，处理方法到调用chilkatPython的目录下，先把chilkat.cpy修改一下文件名，反正不要是chilkat就可以了，防止调用他而不去调用chilkat.py；然后我们再修改chilkat.py；在chilkat.py中找到 def lastHtmlTitle(*args):函数，修改为

    def lastHtmlTitle(*args):
        utfchar = _chilkat.CkSpider_lastHtmlTitle(*args)
        info = unicode(utfchar,"utf-8")
        return info

这样输出的就不是乱码了。

由于是很入门的例子，代码没啥具体可说的，就是取页面title的功能。

Extract HTML Title, Description, Keywords（Chilkat/Python学习二 ）

Getting Started Spidering a Site使用Chilkat（python）练习的一个爬虫（from :http://www.example-code.com）

Extract HTML Title, Description, Keywords（Chilkat/Python学习二）