sphinx学习技巧：亿万级项目都在用的sphinx

来源：互联网发布：java 字符串 == 编辑：程序博客网时间：2024/05/23 23:46

前言

年轻的时候总以为很多app或者网站的搜索功能是基于cache+sql的模式进行查询的，也未曾想过数据是亿万级别，用户也是亿万级别时候，cache和sql的入门级模式是否能应对。答案是肯定不能hold住的，现在年长了些，随着项目的发展有幸接触到相关解决方案，所以想记下来，备忘。
那么老规则本文主要解决三个问题：
1.如何解决与设计数据和用户都是亿万级别的搜索的思路。
2.sphinx的简介与特性
3.sphinx的安装与运行
4.sphinx在亿级项目中的使用场景

正文开始

1.如何解决与设计数据和用户都是亿万级别的搜索的思路。

首先当数据量和用户基数很大时候，意味着三个问题需要解决：
1.查询的次数会很多并且需要快速返回；
2.查询并发数会很高，如何正确的分流分压；
3.数据的增长会很快，这部分增长的数据如何有效的处理才能实时搜索到；

MySQL自身的全文索引搜索慢，定制化程度低，自然无法满足解决上述问题，那么就需要更高性能的自定义的搜索，sphinx出现了，它提供了针对上述三大问题都有相应的解决方案。sphinx是以以俄国全文检索引擎，提供了高速、低空间占用、高结果相关度的全文搜索功能。主要方式是提供符合条件的数据源给sphinx，sphinx生成索引，依赖索引对外提供服务。更重要的是sphinx内置mysql数据库数据源的支持，使用起来非常简单，和使用mysql很大程度相似。

2.sphinx的简介与特性

我的理解中的sphinx

1.sphinx的机制两部分构成：生成索引＋search索引
2.sphinx索引类型：普通索引＋rt实时索引＋分布式索引

特性(最新版sphinx性能某些方面更高于下面描述)

1.高速的建立索引(在当代CPU上，峰值性能可达到10 MB/秒);
2.高性能的搜索(在2 – 4GB 的文本数据上，平均每次检索响应时间小于0.1秒)；
3.可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可
处理100 M 文档);
4.提供了优秀的相关度算法，基于短语相似度和统计（BM25）的复合Ranking方法;
5.支持分布式搜索;
6.可作为MySQL的存储引擎提供搜索服务;
7.支持布尔、短语、词语相似度等多种检索模式;
8.文档支持多个全文检索字段(最大不超过32个);
9.文档支持多个额外的属性信息(例如：分组信息，时间戳等);
10.支持单一字节编码和UTF-8编码;
11.原生的MySQL支持(同时支持MyISAM 和InnoDB );
12.原生的PostgreSQL 支持.

反正就是很牛逼就是了。

3.sphinx的安装与运行(此部分转载的)

1.需要安装的软件
coreseek的mmseg包
mysql安装包
sphinx-0.9.8版
sphinx中文分词补丁1
sphinx中文分词补丁2

2.安装libmmseg

tar -zxvf mmseg-0.7.3.tar.gz cd mmseg-0.7.3 ./configure --prefix=/usr/local/mmseg make make install 1
2
3
4
5
1
2
3
4
5

有问题尝试执行下面命令

echo '/usr/local/mmseg/lib' >> /etc/ld.so.conf ldconfig -v ln -s /usr/local/mmseg/bin/mmseg /bin/mmseg1
2
3
1
2
3

3.重新编译mysql
安装sphinx之前先装两个补丁。

tar -zxvf sphinx-0.9.8-rc2.tar.gz cd sphinx-0.9.8 patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch patch -p1 < ../fix-crash-in-excerpts.patch1
2
3
4
1
2
3
4

4.安装sphinx

cd /root/lemp/sphinx-0.9.8-rc2 ./configure --prefix=/usr/local/sphinx --with-mysql=/opt/mysql / --with-mysql-includes=/opt/mysql/include/mysql --with-mysql-libs=/opt/mysql/lib/mysql / --with-mmseg-includes=/usr/local/mmseg/include --with-mmseg-libs=/usr/local/mmseg/lib --with-mmseg make1
2
3
4
5
1
2
3
4
5

tokenizer_zhcn.cpp:1:30: SegmenterManager.h: 没有那个文件或目录 tokenizer_zhcn.cpp:2:23: Segmenter.h: 没有那个文件或目录1
2
1
2

make clean ./configure --prefix=/usr/local/sphinx --with-mysql=/opt/mysql / --with-mysql-includes=/usr/local/mysql/include/mysql --with-mysql-libs=/opt/mysql/lib/mysql / --with-mmseg-includes=/usr/local/mmseg/include/mmseg --with-mmseg-libs=/usr/local/mmseg/lib --with-mmseg/root/sphinx/sphinx-0.9.8-rc2/src/tokenizer_zhcn.cpp:34: undefined reference to `libiconv_close' collect2: ld returned 1 exit status1
2
3
4
5
6
1
2
3
4
5
6

官网解决办法：In the meantime I've change the configuration file and set#define USE_LIBICONV 0 in line 8179.修改configure 文件把 #define USE_LIBICONV 0 最后的数值由1改为0重新编译。1
2
3
4
5
1
2
3
4
5

make clean ./configure --prefix=/usr/local/sphinx --with-mysql=/opt/mysql / --with-mysql-includes=/usr/local/mysql/include/mysql --with-mysql-libs=/usr/local/mysql/lib/mysql / --with-mmseg-includes=/usr/local/mmseg/include/mmseg --with-mmseg-libs=/usr/local/mmseg/lib --with-mmseg1
2
3
4
1
2
3
4

vi configure输入/define USE_LIBICONV 找到目标行按i键后将1改成0,按esc,输入:wq保存退出1
2
3
1
2
3

make make installcd /usr/local/sphinx/etc cp sphinx.conf.dist sphinx.conf1
2
3
4
1
2
3
4

5.配置sphinx

vim /usr/local/sphinx/etc/sphinx.conftype = mysql # some straightforward parameters for SQL source types sql_host = localhost sql_user = root sql_pass = sql_db = test sql_port = 3306 # optional, default is 3306address = 127.0.0.1 #安全点可以只监听本机1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11

6.索引建立
装好sphinx后在sphinx的目录中有三个目录分别为bin etc var
bin中存有sphinx用到的一些执行文件包括 indexer 索引建立 search 查询工具 searchd 查询服务器。备注：最新版已经没有search 查询工具了

usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf test1  建立索引期间可能由于不同版本的数据库导致indexer找不到共享库libmysqlclient.so.16需要把/opt/mysql/lib/mysql/libmysqlclient.so.16.0.0 这个文件复制到/usr/lib下 或者作软连接即可1
2
3
4
1
2
3
4

7.查询服务器
/usr/local/sphinx/bin/searchd –config /usr/local/sphinx/etc/sphinx.conf 为开启

/usr/local/sphinx/bin/searchd –config /usr/local/sphinx/etc/sphinx.conf –stop 为关闭

sphinx的查询可以大致分为三种

7.1 数据库引擎中的查询7.2 通过search工具查询（最新版已不提供这个工具）    /usr/local/sphinx/bin/search --config     /usr/local/sphinx/etc/sphinx.conf test7.3 通过php的接口查询 详见sphinxapi.php

8.创建sphinx启动脚本与配置

#!/bin/sh # sphinx: Startup script for Sphinx search # # chkconfig: 345 86 14 # description:  This is a daemon for high performance full text / #               search of MySQL and PostgreSQL databases. / #               See http://www.sphinxsearch.com/ for more info. # # processname: searchd # pidfile: $sphinxlocation/var/log/searchd.pid # Source function library. . /etc/rc.d/init.d/functions processname=searchd servicename=sphinx username=sphinx sphinxlocation=/usr/local/sphinx pidfile=$sphinxlocation/var/log/searchd.pid searchd=$sphinxlocation/bin/searchd RETVAL=0 PATH=$PATH:$sphinxlocation/bin start() {     echo -n $"Starting Sphinx daemon: "     daemon --user=$username --check $servicename $processname     RETVAL=$?     echo     [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$servicename } stop() {     echo -n $"Stopping Sphinx daemon: "     $searchd --stop     #killproc -p $pidfile $servicename -TERM     RETVAL=$?     echo     if [ $RETVAL -eq 0 ]; then         rm -f /var/lock/subsys/$servicename         rm -f $pidfile     fi } # See how we were called. case "$1" in     start)         start         ;;     stop)         stop         ;;     status)         status $processname         RETVAL=$?         ;;     restart)         stop sleep 3         start         ;;     condrestart)         if [ -f /var/lock/subsys/$servicename ]; then             stop     sleep 3             start         fi         ;;     *)         echo $"Usage: $0 {start|stop|status|restart|condrestart}"         ;; esac exit $RETVAL1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

chmod 755 /etc/init.d/sphinx chkconfig --add sphinx chkconfig --level 345 sphinx on chkconfig --list|grep sphinx #检查下service sphinx start #运行 service sphinx stop  #停止,官方的脚本在我的as4上有点问题，所以粗鲁的改了下 service sphinx restart #重启 service sphinx status #查看是否运行#检查下已用sphinx用户运行ps aux |grep searchd sphinx   24612  0.0  0.3 11376 6256 pts/1    S    14:07   0:00 searchd1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12

4.sphinx在亿级项目中的使用场景

不管是网站还是app很多产品的设计思路和产品功能多多少少都有相似之处，那么这边主要讲以下几个场景

描述、话题的搜索

主要的实现思路是全量索引＋增量索引方式，可设定时任务定点跑索引

用户昵称的搜索

主要是实现思路是实时索引＋分布式索引的方式，用户由于过多，故使用实时索引的方法进行增加，旧数据通过跑脚本重新读取后再写入。

搜索框联想词的提示

主要实现思路是分布式索引的方式，自动联想其他人曾经输入过的词语。

tip: morphology = stem_en会启用英文单词的提取。搜索英文时候就不会一个一个字母搜了，会提高sphinx搜索英文单词的时候的效率。

0 0