PHP+MongoDB+Coreseek/Sphinx（xmlpipe2数据源）打造千万级搜索引擎

来源：互联网发布：php input获取数据编辑：程序博客网时间：2024/06/07 05:20

近几年来，Linux+Nginx+PHP+MongoDB（LNPM）的组合越来越火，甚至有取代Linux+Nginx/Apache+PHP+Mysql组合的趋势。原因是MongoDB强大，灵活，易扩展，更关键的易用。MongoDB不用事先设计好表结构，往里面插入什么都可以，而且管理方便。因此成为创业团队的首选数据库，更是移动互联网的一枝新秀。

然而MongoDB和关系型数据库也有很多相似之处，比如全文索引不支持中文。MongoDB在2.6版本中开始默认支持全文索引，一如既往的不支持伟大的Chinese，所以如果需要搜索功能，就要另辟蹊径。
Sphinx和Lucene是做搜索引擎的不错的选择。个人观点Lucene对Java的支持比较好，而Sphixn对PHP的支持较好，所以我选择了Sphinx。其实Sphinx对中文的支持也不是很好，因为Sphinx是根据空格来分词（适用与英文），根本不适用中文分词。幸好有人提供了基于Sphinx的支持中文的插件Coreseek和Sphinx—for—chinese。
Coreseek有完整的文档，目前支持最新版的Sphinx，所以我选择Coreseek。
Sphinx-for-chinese严重缺乏文档。

安装：
Coreseek安装。
Sphinx-for-chinese安装。
创建索引:
Coreseek支持与Mysql直接对接，只要在Coreseek配置文件里填上Mysql的信息，Coreseek就会自动读取Mysql数据来创建索引（当然前提是你做了生成索引的相应设置或执行生成索引的命令）。然而Sphinx不支持与MongoDB直接对接，可以把Mongo数据源转换为Python数据源或转换成xmlpipe2数据源。
本人不会Python，所以用php些了一个xml管道用于把MongoDB数据传输到Coreseek。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
<?php
 
class SphinxXmlpipe{
 
    private$xmlWriter;
    private$fields= array();
    private$attributes= array();
    private$documents= array();
 
    publicfunctionsetFields($fields) {
        $this->fields =$fields;
    }
 
    publicfunctionsetAttributes($attributes) {
        $this->attributes =$attributes;
    }
 
    publicfunctionbeginOutput() {
        //create a new xml document
        $this->xmlWriter =new\XMLWriter();
        $this->xmlWriter->openMemory();
        $this->xmlWriter->setIndent(true);
        $this->xmlWriter->startDocument('1.0','UTF-8');
 
        $this->xmlWriter->startElement('sphinx:docset');
        $this->xmlWriter->startElement('sphinx:schema');
 
    // add fileds to the schma
    foreach($this->fieldsas$field) {
        $this->xmlWriter->startElement('sphinx:field');
        $this->xmlWriter->writeAttribute('name',$field);
        $this->xmlWriter->endElement();
        }
 
    /*
    // add atttributes to the schema
    foreach($this->attributes as $attributes) {
        $this->xmlWriter->startElement('sphinx:attr');
        foreach($attributes as $key => $value) {
        $this->xmlWriter->writeAttribute($key, $value);
        }
        $this->xmlWriter->endElement();
    }
    */
    $this->xmlWriter->endElement();// schema
    }
 
    publicfunctionaddDocument($doc) {
    $this->xmlWriter->startElement('sphinx:document');
    $this->xmlWriter->writeAttribute('id',$doc['book_id']);
 
    foreach($docas$key => $value) {
        $this->xmlWriter->startElement($key);
        $this->xmlWriter->text($value);
        $this->xmlWriter->endElement();
    }
 
    $this->xmlWriter->endElement();// document
    }
 
    publicfunctionendOutput() {
    // end sphinx:docset
    $this->xmlWriter->endElement();
    $this->xmlWriter->endDocument();
    echo$this->xmlWriter->outputMemory();
    }
 
    publicfunctionxmlpipe2() {
    $this->setfields(array(
        'book_id',
        'book_name',
    ));
 
    $this->setAttributes(array(
        array(
        'name'=>'book_id',
        'type'=>'int',
        'bits'=>'16',
        'default'=>'1',
        ),
    ));
 
    $this->beginOutput();
 
    $mBook= D('book');
    $count=$mBook->count();
    $limit= c('XMLPIPE_BOOKS_COUNT_PER_TIME');
    $tCont= (int)$count/$limit;
    $oCount=$count%$limit;
    if($tCont>0) {
        do{
        $books=$mBook->field('book_id,book_name','_id=>0')->limit($limit)->select();
        foreach($booksas$book) {
            $this->addDocument($book);
        }
        unset($books);
        $tCont--;
        }while($tCont>0);
 
        $books=$mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select();
        foreach($booksas$book) {
        $this->addDocument($book);
        }
        unset($books);
    }else{
        $books=$mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select();
        foreach($booksas$book) {
        $this->addDocument($book);
        }
        unset($books);
    }
 
    $this->endOutput();
    }
 
}

输出的xml格式如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<document>
<id>123</id>
<group>45</group>
<timestamp>1132223498</timestamp>
<title>test title</title>
<body>
this is my document body
</body>
</document>
 
<document>
<id>124</id>
<group>46</group>
<timestamp>1132223498</timestamp>
<title>another test</title>
<body>
this is another document
</body>
</document>

相应的Coreseek设置

source src1{        type= xmlpipe2xmlpipe_command= cd /var/www/PHPParser && php index.php /Home/SphinxXmlpipe/xmlpipe2xmlpipe_field= book_idxmlpipe_field= book_namexmlpipe_attr_timestamp= book_idxmlpipe_attr_uint= book_idxmlpipe_fixup_utf8= 1}

搜索：
1、PHP提供了Sphinx扩展，适用于Coreseek。
2、Sphinx 安装包提供了sphinxapi，在api目录下。
我用的PHP扩展
sphinx搜索代码示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public function getResultBySearchText($search_text) {
        $sphinxClient=new \SphinxClient();
        $sphinxClient->setServer('localhost', 9312);  // server = localhost,port = 9312.
        $sphinxClient->setMatchMode(SPH_MATCH_ANY);
        $sphinxClient->setMaxQueryTime(5000); // set search time 5  seconds.
 
        $result=$sphinxClient->query($search_text);
 
        if(isset($result['matches'])) {
            $rel['time'] =$result['time'];
            $rel['matches'] =$result['matches'];
            return$rel;
        }else{
            $rel['time'] =$result['time'];
            return$rel;
        }
    }

由于用的xmlpipe数据源，所以返回的是文档id，还需要根据id去mongo提取数据。至于如何提取mongo数据，我就不写了，如果需要帮助就连系我吧（sq371426@163.com）。

0 0