Sphinx + MySQL + 中文分词安装配置

来源:互联网 发布:cpa软件联盟 编辑:程序博客网 时间:2024/06/07 21:57

1.所需要下载的软件
mmseg-0.7.3.tar.gz --- 中文分词
http://www.coreseek.com/uploads/sources/mmseg-0.7.3.tar.gz

mysql-5.1.49.tar.gz --- mysql-5.1.14源代码
wget http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.1.49.tar.gz/from/http://mysql.ntu.edu.tw/

sphinx-0.9.8-rc2.tar.gz --- sphinx-0.9.8-rc2源代码
http://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz

fix-crash-in-excerpts.patch --- sphinx支持分词补丁
http://www.coreseek.com/uploads/sources/fix-crash-in-excerpts.patch

sphinx-0.98rc2.zhcn-support.patch --- sphinx支持分词补丁
http://www.coreseek.com/uploads/sources/sphinx-0.98rc2.zhcn-support.patch

2.安装libmmseg
debian-test-server:/home/software# tar zxvf mmseg-0.7.3.tar.gz
./configure
make && make install

编译的时候报如下的错误:
css/UnigramCorpusReader.cpp:89: error: 'strncmp' was not declared in this scope

解决办法:
vi src/css/UnigramCorpusReader.cpp 
添加:#include <string.h>
重新编译以后就没有错误了。

简单测试下mmseg
debian-test-server:/home/software/mmseg-0.7.3# mmseg
Coreseek COS(tm) MM Segment 1.0
Copyright By Coreseek.com All Right Reserved.
Usage: mmseg <option> <file>
-u <unidict>           Unigram Dictionary
-r           Combine with -u, used a plain text build Unigram Dictionary, default Off
-b <Synonyms>           Synonyms Dictionary
-h            print this help and exit

3.安装MySQL及Sphinx for MySQL存储引擎
解压缩sphinx,将sphinx主目录下mysqlse目录下的文件拷贝到MySQL的storage/sphinx目录下,当然sphinx目录是新创建的目录。

两个目录下的文件分别如下,是相同的。
debian-test-server:/home/software/sphinx-0.9.8-rc2/mysqlse# ls
CMakeLists.txt  Makefile.am   ha_sphinx.cc  plug.in             sphinx.5.0.27.diff
HOWTO.txt       gen_data.php  ha_sphinx.h   sphinx.5.0.22.diff  sphinx.5.0.37.diff

debian-test-server:/home/software/mysql-5.1.49/storage/sphinx# ls
CMakeLists.txt  Makefile     Makefile.in   ha_sphinx.cc  plug.in            sphinx.5.0.27.diff
HOWTO.txt       Makefile.am  gen_data.php  ha_sphinx.h   sphinx.5.0.22.diff sphinx.5.0.37.diff

debian-test-server:/home/software/mysql-5.1.49# sh BUILD/autorun.sh
BUILD/autorun.sh: line 23: aclocal: command not found
Can't execute aclocal

google了一下,发现是需要安装automake模块
http://ronaldbradford.com/blog/compiling-mysql-5051-under-ubuntu-710-2008-01-14/

aptitude install automake

后来还报如下的错误
BUILD/autorun.sh: line 26: libtoolize: command not found
Can't execute libtoolize

aptitude install libtool

安装过程中遇到了很多次不是缺少这个包、就是缺少那个包的错误,还好一一都解决了,遇到困难有时并不一定是坏事,可以考验自己处理各种困难的能力,下次再次遇到时就相对简单一些了。

执行configure之前,首先要执行autorun.sh脚本,否则sphinx引擎无法正确安装,这一步非常重要,请千万注意。
sh BUILD/autorun.sh

如果你遇到下面的错误

编译的时候还报了下面的错误,最后在configure后面加上选项--sysconfdir=/etc后,编译的时候就不报错了。
my_new.cc
../include/my_global.h:1103: error: redeclaration of C++ built-in type bool
make[1]: *** [my_new.o] Error 1
make[1]: Leaving directory `/home/mysql-5.1.49/mysys'
make: *** [all-recursive] Error 1

./configure --with-plugins=sphinx --prefix=/usr/local/webserver/mysql/ --sysconfdir=/etc --enable-assembler --with-extra-charsets=complex --enable-thread-safe-client --with-big-tables --with-readline --with-ssl --with-embedded-server --enable-local-infile

debian-test-server:/home/software/mysql-5.1.49# ./configure --help | grep sphinx
                          innodb_plugin myisam myisammrg ndbcluster sphinx.
  Plugin Name:      sphinx

执行./configure -h命令可以看到类似如下的内容,说明sphinx已经配置生效了。
...
...
...
 === Sphinx Storage Engine ===
  Plugin Name:      sphinx
  Description:      Sphinx Storage Engines
  Supports build:   static and dynamic
  Configurations:   max, max-no-ndb

make && make install

mysql安装完成以后,启动mysql服务,用命令行登录mysql,输入show engines,会出现sphinx存储引擎,要的就是这个效果。
mysql> show engines;
+------------+---------+-----------------------------------------------------------+--------------+------+------------+
| Engine     | Support | Comment                                                   | Transactions | XA   | Savepoints |
+------------+---------+-----------------------------------------------------------+--------------+------+------------+
| CSV        | YES     | CSV storage engine                                        | NO           | NO   | NO         |
| SPHINX     | YES     | Sphinx storage engine 0.9.8                               | NO           | NO   | NO         |
| MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables | NO           | NO   | NO         |
| MyISAM     | DEFAULT | Default engine as of MySQL 3.23 with great performance    | NO           | NO   | NO         |
| MRG_MYISAM | YES     | Collection of identical MyISAM tables                     | NO           | NO   | NO         |
+------------+---------+-----------------------------------------------------------+--------------+------+------------+
5 rows in set (0.07 sec)


4.安装sphinx
给sphinx打上补丁
debian-test-server:/home/software/sphinx-0.9.8-rc2# patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch
(Stripping trailing CRs from patch.)
patching file Makefile.in
(Stripping trailing CRs from patch.)
patching file acinclude.m4
...

debian-test-server:/home/software/sphinx-0.9.8-rc2# patch -p1 < ../fix-crash-in-excerpts.patch
(Stripping trailing CRs from patch.)
patching file src/tokenizer_zhcn.cpp

在coreseek网站下载的sphinx有--with--mseg配置选项。
debian-test-server:/home/software/sphinx-0.9.8-rc2# ./configure --help | grep mmseg
  --with-mmseg            compile with libmmseg, a mmseg Chinese Segmenter

debian-test-server:/home/software/sphinx-0.9.8-rc2# ./configure --help | grep mysql
  --with-mysql            compile with MySQL support (default is enabled)

debian-test-server:/home/software/sphinx-0.9.8-rc2# ./configure --help | grep prefix
  --prefix=PREFIX         install architecture-independent files in PREFIX

debian-test-server:/home/software/sphinx-0.9.8-rc2# ./configure --prefix=/usr/local/webserver/sphinx --with-mysql=/usr/local/webserver/mysql --with-mmseg-includes=/usr/local/include/mmseg --with-mmseg-lib=/usr/local/lib

编译时报如下的错误
/usr/local/include/mmseg/freelist.h:22: error: 'strlen' was not declared in this scope

vi /usr/local/include/mmseg/freelist.h
加上include <string.h>,再次编译就不会报错了。

make && make install

5.配置Sphinx
sphinx的数据用MySQL时,主要是要修改以下几项
 # some straightforward parameters for SQL source types
        sql_host                                = localhost
        sql_user                                = root
        sql_pass                                =
        sql_db                                  = test
        sql_port                                = 3306  # optional, default is 3306

数据源除了可以用mysql外,还可以用pgsql,xml

将sphinx自带的example.sql导入到mysql中
debian-test-server:/usr/local/webserver/sphinx/etc# mysql < example.sql

mysql> use test;
Database changed
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| documents      |
+----------------+
1 row in set (0.00 sec)

mysql> select * from documents;
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
| id | group_id | group_id2 | date_added          | title           | content                                                                   |
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
 1 |        1 |         5 | 2010-08-08 03:12:29 | test one        | this is my test document number one. also checking search within phrases. |
 2 |        1 |         6 | 2010-08-08 03:12:29 | test two        | this is my test document number two                                       |
 3 |        2 |         7 | 2010-08-08 03:12:29 | another doc     | this is another group                                                     |
 4 |        2 |         8 | 2010-08-08 03:12:29 | doc number four | this is to test groups                                                    |
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
4 rows in set (0.00 sec)

6.建立索引
建立索引的时候还遇到了找不到mysql动态库的问题,将mysql的动态库链接到/usr/lib目录下即可。
debian-test-server:/usr/local/webserver/mysql/lib/mysql# ln -s /usr/local/webserver/mysql/lib/mysql/libmysqlclient.so /usr/lib/libmysqlclient.so
debian-test-server:/usr/local/webserver/mysql/lib/mysql# ln -s /usr/local/webserver/mysql/lib/mysql/libmysqlclient.so.16 /usr/lib/libmysqlclient.so.16

debian-test-server:/usr/local/webserver/mysql/lib/mysql# /usr/local/webserver/sphinx/bin/indexer --help
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/webserver/sphinx/etc/sphinx.conf'...
WARNING: no such index '--help', skipping.

建立索引
debian-test-server:/usr/local/webserver/sphinx/etc# ../bin/indexer --config ./sphinx.conf test1
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file './sphinx.conf'...
indexing index 'test1'...
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 193 bytes
total 0.030 sec, 6510.37 bytes/sec, 134.93 docs/sec

7.查询
启动sphinx查询服务。
debian-test-server:/usr/local/webserver/sphinx/etc# ../bin/searchd --config ./sphinx.conf

debian-test-server:/usr/local/webserver/sphinx/etc# ps aux | grep sphinx
root      5656  0.0  0.5   6396   652 pts/1    S<   03:44   0:00 ../bin/searchd --config ./sphinx.conf
root      5661  0.0  0.4   1832   572 pts/1    R<+  03:47   0:00 grep sphinx

查询测试
debian-test-server:/usr/local/webserver/sphinx/bin# ./search --config ../etc/sphinx.conf test
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '../etc/sphinx.conf'...
index 'test1': query 'test ': returned 3 matches of 3 total in 0.000 sec

displaying matches:
1. document=1, weight=2, group_id=1, date_added=Sun Aug  8 03:12:29 2010
        id=1
        group_id=1
        group_id2=5
        date_added=2010-08-08 03:12:29
        title=test one
        content=this is my test document number one. also checking search within phrases.
2. document=2, weight=2, group_id=1, date_added=Sun Aug  8 03:12:29 2010
        id=2
        group_id=1
        group_id2=6
        date_added=2010-08-08 03:12:29
        title=test two
        content=this is my test document number two
3. document=4, weight=1, group_id=2, date_added=Sun Aug  8 03:12:29 2010
        id=4
        group_id=2
        group_id2=8
        date_added=2010-08-08 03:12:29
        title=doc number four
        content=this is to test groups

words:
1. 'test': 3 documents, 5 hits

sphinx的查询可以大致分为三种

a.数据库引擎中的查询
通过MySQL的Sphinx引擎进行查询

b. 通过search工具查询
/usr/local/webserver/sphinx/bin/search --config /usr/local/webserver/sphinx/etc/sphinx.conf test

c. 通过php的接口查询 详见sphinxapi.php. 
可以在sphinx的原始安装目录下找到一些php的接口
debian-test-server:/home/software/sphinx-0.9.8-rc2# find ./ -name *.php
./mysqlse/gen_data.php
./api/test.php
./api/sphinxapi.php
./api/test2.php
./test/ubertest.php

同时还有python及java的接口
debian-test-server:/home/software/sphinx-0.9.8-rc2/api# ls -l
total 84
drwxr-xr-x 2 500 500  4096 Mar 29  2008 java
-rw-r--r-- 1 500 500 33641 Feb 19  2008 sphinxapi.php
-rw-r--r-- 1 500 500 23191 Mar 14  2008 sphinxapi.py
-rw-r--r-- 1 500 500  5162 Jan 24  2008 test.php
-rw-r--r-- 1 500 500  3331 Feb   2008 test.py
-rw-r--r-- 1 500 500  1053 Nov 16  2007 test2.php
-rw-r--r-- 1 500 500   579 Nov 22  2006 test2.py

8.中文分词的应用
偶然间发现mmseg里面有几个ruby写的程序。
debian-test-server:/home/software/mmseg-0.7.3# find ./ -name *.rb
./ruby/extconf.win.rb
./ruby/test.rb
./ruby/extconf.lin.rb

生成字曲文件unigram.txt.uni
debian-test-server:/home/software/mmseg-0.7.3/data# mmseg -u unigram.txt
debian-test-server:/home/software/mmseg-0.7.3/data# ls
Lexicon_full_words.txt  build_unigram.py  char.stat.txt  unigram.txt  unigram.txt.uni

拷贝到sphinx根目录下:
debian-test-server:/home/software/mmseg-0.7.3/data# cp unigram.txt.uni /usr/local/webserver/sphinx/

需要将unigram.txt.uni改名为uni.lib(不能是其它的文件名,我亲自试过的),否则在sphinx.conf中加入了charset_dictpath = /usr/local/webserver/sphinx/且建立索引的时候会报错,切记。
Unigram dictionary load Error
Segmentation fault
...
在index test1{
}
中间加入
charset_type = zh_cn.utf-8
charset_dictpath = /usr/local/webserver/sphinx/

如果searchd已经再运行,先kill它再运行
debian-test-server:/home/software/mmseg-0.7.3/data# /usr/local/webserver/sphinx/bin/searchd --config /usr/local/webserver/sphinx/etc/sphinx.conf --stop
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/webserver/sphinx/etc/sphinx.conf'...
stop: succesfully sent SIGTERM to pid 5666

debian-test-server:/home/software/mmseg-0.7.3/data# /usr/local/webserver/sphinx/bin/searchd --config /usr/local/webserver/sphinx/etc/sphinx.conf

注意:在添加完数据后,需要重新加载索引,这样新的数据才能被缓存进去
重建索引,成功后,开启索引监听

debian-test-server:/usr/local/webserver/sphinx# /usr/local/webserver/sphinx/bin/indexer --config /usr/local/webserver/sphinx/etc/sphinx.conf test1
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/webserver/sphinx/etc/sphinx.conf'...
indexing index 'test1'...
collected 5 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 5 docs, 362 bytes
total 0.041 sec, 8751.78 bytes/sec, 120.88 docs/sec

创建sphinx表
CREATE TABLE `sphinx` (  
`id` INT(11) NOT NULL,  
`weight` INT(11) NOT NULL,  
`query` VARCHAR(255) NOT NULL,  
`CATALOGID` INT NOT NULL,  
`EDITUSERID` INT NOT NULL,  
`HITS` INT NULL,  
`ADDTIME` INT NOT NULL,   KEY 
`Query` (`Query`) 
) ENGINE=SPHINX DEFAULT CHARSET=utf8 CONNECTION='sphinx://localhost:3312/test1'

查询中文的时候没有结果,暂时不知道为什么,留在以后再去研究吧,现在对sphinx的了解还不是很深入。
sphinx.conf中的相关配置也修改了。

sql_query_pre                   = SET NAMES utf8
charset_type            = utf-8

SELECT doc.* FROM documents doc JOIN sphinx ON ( doc.id = sphinx.id ) WHERE `query` = '中文;mode=any'

查询英文的时候有结果。
SELECT doc.* FROM documents doc JOIN sphinx ON ( doc.id = sphinx.id ) WHERE `query` = 'test;mode=any'
id group_id group_id2 date_added title content
2010-08-08 03:12:29 test one this is my test document number one. also checking search within phrases.
2010-08-08 03:12:29 test two this is my test document number two
2010-08-08 03:12:29 doc number four this is to test groups

原创粉丝点击