Python+Lucene

来源:互联网 发布:米聊聊天软件弊端 编辑:程序博客网 时间:2024/06/05 03:10

Python+Lucene(pylucene) + Paoding的安装配置


pylucene让Python可以调用Lucene API实现搜索,这个项目紧跟Lucene的步调,对用惯了Python的同学来说是个福音。

pylucene是通过JCC实现的,JCC读取 jar 包里的public class/method签名,生成C++的包装类,通过JNI(Java Native Interface)调用java的class/mathod。C++代码转成Python的扩展模块,在Python虚拟机里嵌入JVM就可以用了。细节参考http://lucene.apache.org/pylucene/jcc/documentation/readme.html 。

由于Paoding跟Lucene 2.9版本以前的接口是一致的,因此找了一个最接近的PyLucene版本(pylucene 2.4),但里面的JCC比较老了,因此使用了pylucene 3.3的JCC。

下文假定 python 2.7.2安装到 /data/python-2.7.2 目录,相关源码保存在 /data/src 目录。


1 安装 Python

下载Python 2.7.2
切换到解压目录
./configure --prefix=/data/python-2.7.2 --enable-shared
make && make install
export LD_LIBRARY_PATH=/data/python-2.7.2/lib

安装包 setuptools

wget
http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz#md5=7df2a529a074f613b509fb44feefe74e
tar zxvf setuptools-0.6c11.tar.gz
cd setuptools-0.6c11
/data/python-2.7.2/bin/python setup.py install


2 安装 JCC 2.10

下载 pylucene-3.3-3-src.tar.gz
切换到解压目录
cd jcc

给 setuptools打补丁
mkdir tmp
cd tmp
unzip -q /data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg
patch -Nup0 < /data/src/pylucene-3.3-3/jcc/jcc/patches/patch.43.0.6c11
sudo zip
/data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg -f
cd ..
rm -rf tmp

ln -sf /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 /usr/lib/jvm/java-6-openjdk
/data/python-2.7.2/bin/python setup.py build
/data/python-2.7.2/bin/python setup.py install


3 安装 PyLucene + Paoding

下载 pylucene-2.4.1-2-src.tar.gz 和 paoding-analysis-2.0.4-beta.zip
tar zxvf pylucene-2.4.1-2-src.tar.gz
mkdir paoding
cd paoding
unzip ../paoding-analysis-2.0.4-beta.zip

切换到 pylucene-2.4.1-2解压目录
vi Makefile 修改内容如下
...
# Linux (Ubuntu 8.10 64-bit, Python 2.5.2, OpenJDK 1.6, setuptools 0.6c9)
PREFIX_PYTHON=/data/python-2.7.2
ANT=ant
PYTHON=$(PREFIX_PYTHON)/bin/python
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=2
...
JARS=$(LUCENE_JAR) $(SNOWBALL_JAR) $(HIGHLIGHTER_JAR) $(ANALYZERS_JAR) \
$(REGEX_JAR) $(QUERIES_JAR) $(INSTANTIATED_JAR) $(EXTENSIONS_JAR) \
/data/src/paoding/paoding-analysis.jar
...
GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \
--include /data/src/paoding/lib/commons-logging.jar \
--package java.lang java.lang.System \
...

运行

make
make install

4 测试

export LD_LIBRARY_PATH=/data/python-2.7.2/lib
export PAODING_DIC_HOME=/data/src/paoding/dic
/data/python-2.7.2/bin/python /data/src/testpylucene.py

testpylucene.py的内容如下:
# -*_ coding: utf-8 -*-
#
from lucene import *

texts = ["Python是一个很有吸引力的语言",
"C++语言也很有吸引力,长久不衰",
"我们希望Python和C++高手加入",
"我们的技术巨牛,人人都是高手"]

def search(searcher, qtext):
    tq = TermQuery(Term("content", qtext))
    hits = searcher.search(tq)
    print "----------------------------------------------"
    print "Query:'%s', %d Found" % (qtext,hits.length())
    for i in range(hits.length()):
        doc = hits.doc(i)
        print "\t",doc.get("content")

def dump(reader):
    for i in range(reader.maxDoc()):
    print "-----------------------------------------------"
    tv = reader.getTermFreqVector(i, "content")
    for tk in tv.getTerms():
    print tk

initVM()
directory = RAMDirectory()
analyzer = PaodingAnalyzer()
writer = IndexWriter(directory, analyzer, True)
for text in texts:
    doc = Document()
    doc.add(Field("content", text, Field.Store.YES, Field.Index.TOKENIZED,
        Field.TermVector.YES))
    writer.addDocument(doc)
writer.optimize()
writer.close()
reader = IndexReader.open(directory)
dump(reader)
searcher = IndexSearcher(directory)
search(searcher, "python")
search(searcher, "C++")
search(searcher, "高手")