关于站内搜索的那些事儿

来源：互联网发布：开户银行数据编辑：程序博客网时间：2024/06/05 22:41

- 前言
- 模块化
  - 登录模块
  - 博客扫描模块
  - 博客详情模块
  - 搜索模块
- 演示
  - 案例一
  - 案例二
- 总结

前言

之前学过一点点关于全文检索相关的技术，当时使用的是Java语言，Lucene和compass框架。有兴趣的话可以参考下面的专栏链接
http://blog.csdn.net/column/details/lucene-compass.html

然后现在用的是Python了，所以需要迭代一下。网上搜索了下，相关的还真不少，还有pylucene，但是相比较而言，whoosh更为出色。那今天就用它吧。

安装它也比较简单。

pip install whoosh

这样就可以了。

目标：对自己的博客进行“站内搜索”，来稍微改善一下CSDN站内查找的缺点。

模块化

最近越来越喜欢把任务模块化了，这样单个的功能也比较容易管理，而且整合的时候对集成测试也比较方便。或者添加新功能，重构，都很方便。

针对上面的需求，我这里设计了几个小模块，待会逐个进行解释。

登录模块

登录模块是有点必须的，这是因为在获取博客详细内容的时候，需要有一个已经登录的session会话来支撑，否则拿不到数据。

先前也写过一点关于CSDN模拟登陆的例子，当时完成的功能有

模拟登陆
顶、踩文章
发评论
获取博主详情

为了不让别有用心的人拿代码做坏事，我这里就不贴代码了。技术方面欢迎私信，或者在文章下面发评论。

下面把模拟登陆的代码补上。

class Login(object):    """    Get the same session for blog's backing up. Need the special username and password of your account.    """    def __init__(self):        # the common headers for this login operation.        self.headers = {            'Host': 'passport.csdn.net',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',        }    def login(self, username, password):        if username and password:            self.username = username            self.password = password        else:            raise Exception('Need Your username and password!')        loginurl = 'https://passport.csdn.net/account/login'        # get the 'token' for webflow        self.session = requests.Session()        response = self.session.get(url=loginurl, headers=self.headers)        soup = BeautifulSoup(response.text, 'html.parser')        # Assemble the data for posting operation used in logining.        self.token = soup.find('input', {'name': 'lt'})['value']        payload = {            'username': self.username,            'password': self.password,            'lt': self.token,            'execution': soup.find('input', {'name': 'execution'})['value'],            '_eventId': 'submit'        }        response = self.session.post(url=loginurl, data=payload, headers=self.headers)        # get the session        return self.session if response.status_code == 200 else None

博客扫描模块

博客扫描这个模块不需要登录状态的支持，完成的功能是扫描博主的文章总数，以及每个文章对应的URL链接。因为接下来会用它来获取文章的详情。

class BlogScanner(object):    """    Scan for all blogs    """    def __init__(self, domain):        self.username = domain        self.rooturl = 'http://blog.csdn.net'        self.bloglinks = []        self.headers = {            'Host': 'blog.csdn.net',            'Upgrade - Insecure - Requests': '1',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',        }    def scan(self):        # get the page count        response = requests.get(url=self.rooturl + "/" + self.username, headers=self.headers)        soup = BeautifulSoup(response.text, 'html.parser')        pagecontainer = soup.find('div', {'class': 'pagelist'})        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]        # construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2        for index in range(1, int(pages) + 1):            # get the blog link of each list page            listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))            response = requests.get(url=listurl, headers=self.headers)            soup = BeautifulSoup(response.text, 'html.parser')            try:                alinks = soup.find_all('span', {'class': 'link_title'})                # print(alinks)                for alink in alinks:                    link = alink.find('a').attrs['href']                    link = self.rooturl + link                    self.bloglinks.append(link)            except Exception as e:                print('出现了点意外！\n' + e)                continue        return self.bloglinks

博客详情模块

关于博客详情，我倒是觉得CSDN做的真不赖。而且是json格式的。话不多说，看下登录状态下能获取到的博客的详细内容吧。

这下思路很清晰了，就是要获取标题，URL，标签，摘要描述，文章正文内容。代码如下：

class BlogDetails(object):    """    Get the special url for getting markdown file.    'url':博客URL    'title': 博客标题    'tags': 博客附属标签    'description': 博客摘要描述信息    'content': 博客Markdown源码    """    def __init__(self, session, blogurl):        self.headers = {            'Referer': 'http://write.blog.csdn.net/mdeditor',            'Host': 'passport.csdn.net',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',        }        # constructor the url: get article id and the username        # http://blog.csdn.net/marksinoberg/article/details/70432419        username, id = blogurl.split('/')[3], blogurl.split('/')[-1]        self.blogurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)        self.session = session    def getSource(self):        # get title and content for the assigned url.        try:            tempheaders = self.headers            tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'            tempheaders['Host'] = 'write.blog.csdn.net'            tempheaders['X-Requested-With'] = 'XMLHttpRequest'            response = self.session.get(url=self.blogurl, headers=tempheaders)            soup = json.loads(response.text)            return {                'url': soup['data']['url'],                'title': soup['data']['title'],                'tags': soup['data']['tags'],                'description': soup['data']['description'],                'content': soup['data']['markdowncontent'],            }        except Exception as e:            print("接口请求失败! 详细信息为：{}".format(e))

搜索模块

搜索模块是今天的核心，使用到的库就是whoosh，真的是很贴心的一个库，而且文档详细，简单易懂。我这蹩脚的英文水平都可以，你也一定可以的。

上一道小菜： http://whoosh.readthedocs.io/en/latest/

默认的文本分析器是英文的，所以为了更好的照顾到中文相关，就得处理一下中文分词，于是在网上抄了一个，不过效果不咋地。

class ChineseTokenizer(Tokenizer):    def __call__(self, value, positions=False, chars=False, keeporiginal=False,                 removestops=True, start_pos=0, start_char=0, mode='', **kwargs):        assert isinstance(value, text_type), "%r is not unicode"%value        t = Token(positions=positions, chars=chars, removestops=removestops, mode=mode, **kwargs)        # 使用jieba分词，分解中文        seglist = jieba.cut(value, cut_all=False)        for w in seglist:            t.original = t.text = w            t.boost = 1.0            if positions:                t.pos = start_pos + value.find(w)            if chars:                t.startchar = start_char + value.find(w)                t.endchar = start_pos + value.find(w) + len(w)            yield tdef ChineseAnalyzer():    return ChineseTokenizer()class Searcher(object):    """    Firstly： define a schema suitable for this system. It may should be hard-coded.            'url':博客URL            'title': 博客标题            'tags': 博客附属标签            'description': 博客摘要描述信息            'content': 博客Markdown源码    Secondly: add documents(blogs)    Thridly: search user's query string and return suitable high score blog's paths.    """    def __init__(self):        # define a suitable schema        self.schema = Schema(url=ID(stored=True),                             title=TEXT(stored=True),                             tags=KEYWORD(commas=True),                             description=TEXT(stored=True),                             content=TEXT(analyzer=ChineseAnalyzer()))        # initial a directory to storage indexes info        if not os.path.exists("indexdir"):            os.mkdir("indexdir")        self.indexdir = "indexdir"        self.indexer = create_in(self.indexdir, schema=self.schema)    def addblog(self, blog):        writer = self.indexer.writer()        # write the blog details into indexes        writer.add_document(url=blog['url'],                            title=blog['title'],                            tags=blog['tags'],                            description=blog['description'],                            content=blog['content'])        writer.commit()    def search(self, querystring):        # make sure the query string is unicode string.        # querystring = u'{}'.format(querystring)        with self.indexer.searcher() as seracher:            query = QueryParser('content', self.schema).parse(querystring)            results = seracher.search(query)            # for item in results:            #    print(item)        return results

演示

好了，差不多就是这样了。下面来看下运行的效果。

案例一

首先看下对于DBHelper这个关键字的搜索，因为文章过多的话计算也是比较慢的，所以就爬取前几篇文章好了。

# coding: utf8# @Author: 郭 璞# @File: TestAll.py                                                                 # @Time: 2017/5/12                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcherlogin = Login()session = login.login(username="Username", password="password")print(session)scanner = BlogScanner(domain="Marksinoberg")blogs = scanner.scan()print(blogs[0:3])blogdetails = BlogDetails(session=session, blogurl=blogs[0])blog = blogdetails.getSource()print(blog['url'])print(blog['description'])print(blog['tags'])# test whoosh for searchersearcher = Searcher()counter=1for item in blogs[0:7]:    print("开始处理第{}篇文章".format(counter))    counter+=1    details = BlogDetails(session=session, blogurl=item).getSource()    searcher.addblog(details)# searcher.addblog(blog)searcher.search('DbHelper')# searcher.search('Python')

代码运行结果如下：
DBHelper关键字查找结果

不难发现，本人博客只有前两篇是关于DBHelper 的文章，所以命中了这两个document。看起来还不错。

案例二

下面再来试试其他的关键字。比如Python。

# coding: utf8# @Author: 郭 璞# @File: TestAll.py                                                                 # @Time: 2017/5/12                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcherlogin = Login()session = login.login(username="username", password="password")print(session)scanner = BlogScanner(domain="Marksinoberg")blogs = scanner.scan()print(blogs[0:3])blogdetails = BlogDetails(session=session, blogurl=blogs[0])blog = blogdetails.getSource()print(blog['url'])print(blog['description'])print(blog['tags'])# test whoosh for searchersearcher = Searcher()counter=1for item in blogs[0:10]:    print("开始处理第{}篇文章".format(counter))    counter+=1    details = BlogDetails(session=session, blogurl=item).getSource()    searcher.addblog(details)# searcher.addblog(blog)# searcher.search('DbHelper')searcher.search('Python')

然后依然来看下运行的效果。
Python关键字查找效果

命中了4条记录，命中率也还算说得过去。

总结

最后来总结下。关于whoosh站内搜索的问题，要向更高精度的匹配到文本结果，其实还需要很多地方优化。QueryParser 这块其实还有很多需要挖掘。

另外高亮显示查找结果也是很方便的。官方文档上有详细的介绍。

最后一步就是中文问题，目前我还没有什么好的办法来提高分词和命中率。

6 0