lucene
来源:互联网 发布:moe破解软件下载 编辑:程序博客网 时间:2024/06/05 17:08
概述
Lucene是开源的全文检索引擎工具包,作为一个工具包,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言),Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎.
Solr是一个独立的企业级搜索应用服务器,它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引;也可以通过Http Get操作提出查找请求,并得到XML格式的返回结果。
Solr是基于Lucene开发的,封装了封装了很多Lucene细节,建立的一套搜索引擎系统(索引维护等管理),一个搜索引擎服务(API服务),一个搜索框架(配置解析方式),提供了highlighting,facet,analysis/tokenization等搜索引擎等常见功能,也支持主从,热换库.而反观Lucene,作为底层的搜索引擎实现,更关注索引结构,读写索引等相关性工具,排序等底层实现与优化.但两者在开发上是相互独立的,比如Lucene中开发的新特性,Solr还不一定能及时更新,Solr开发的服务应用可能跟本不需要调用Lucene.
本次学习Lucene使用Lucene6.5.1,需要JDK1.8支持,先讲Lucene,后讲 Solr
Lucene
Lucene简介
- 获取原始内容:任何搜索应用程序的第一个步骤是收集在其上的搜索是要进行的目标内容
- 构建文档:下一步是建立从原始内容的搜索应用程序可以理解和容易理解的文件
- 分析文档:在索引过程启动,该文件是要分析作为其文本部分是一个候选索引。这个过程被称为分析文档
- 索引文件:一旦文档被构建和分析,下一步是将索引它们使得该文件可被检索
Lucene的一个核心就是索引过程--如何将文件生成索引文件以及搜索索引--如何快速获取索引文件,以下图是说明:
上图左边为生成索引,右边为读取索引
索引过程
introduce
- IndexWriter:此类充当创造/在索引过程中更新指标的核心组成部分
- Directory:此类表示索引的存储位置
- Analyzer:Analyzer类负责分析一个文件,并从将被索引的文本获取令牌/字。不加分析完成后,IndexWriter不能创建索引
- Document:Document代表一个虚拟文档与字段,其中字段是可包含在物理文档的内容,元数据等对象。Analyzer只能理解文档
- Filed:Field是最低单元或索引过程的起点。它代表其中一个键被用于识别要被索引的值的键值对关系。用于表示一个文件内容的字段将具有键为“内容”,值可以包含文本或文档的数字内容的部分或全部。 Lucene能索引仅文本或仅数字内容。
code
<dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>6.5.1</version></dependency>demo
// 1.指定索引地址Directory dir = FSDirectory.open(Paths.get("D:\\lucene"));// 2.创建indexWriterAnalyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);// indexWriter的初始化if (create) {iwc.setOpenMode(OpenMode.CREATE);} else {iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(dir, iwc);// 3.创建DocumentDocument doc = new Document();doc.add(new StringField("title", "lucene", Field.Store.YES));// 标题,存储doc.add(new TextField("content", "i am lucene", Field.Store.YES));//内容,内容信息量过大,一般不存储,但是却可以实现分词doc.add(new LongPoint("modified", new Date().getTime()));// 修改时间,不存储// 4.索引if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {writer.deleteDocuments(new Term("title", "lucene"));System.out.println("adding a title: " + "lucene");writer.addDocument(doc);} else {System.out.println("updating a title: " + "new lucene");doc.clear();doc.add(new StringField("title", "new lucene", Field.Store.YES));// 标题,存储doc.add(new TextField("content", "i am new lucene", Field.Store.YES));//内容,内容信息量过大,一般不存储,但是却可以实现分词doc.add(new LongPoint("modified", new Date().getTime()));// 修改时间,不存储writer.updateDocument(new Term("title", "lucene"), doc);//此更新方法,是删除老的,增加新的}//4.关闭writerwriter.close();
索引读取
introduce
code
<dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>6.5.1</version></dependency>
demo
// 1.指定索引地址Directory dir = FSDirectory.open(Paths.get("D:\\lucene"));// 2.创建indexSearcher IndexReader reader = DirectoryReader.open(dir); IndexSearcher searcher = new IndexSearcher(reader); //3.创建query QueryParser parser = new QueryParser("title", new StandardAnalyzer()); Query query = parser.parse("*:*"); //Query query = new TermQuery(new Term("title", "*")); //查询document TopDocs topDocs = searcher.search(query, 100);//类似于分页 int count = topDocs.totalHits;// 根据查询条件匹配出的记录总数 System.out.println("匹配出的记录总数:" + count); ScoreDoc[] scoreDocs = topDocs.scoreDocs;// 根据查询条件匹配出的记录 for (ScoreDoc scoreDoc : scoreDocs) { int docId = scoreDoc.doc;// 获取文档的ID Document doc = searcher.doc(docId); // 通过ID获取文档 System.out.println("title:" + doc.get("title")); System.out.println("content:" + doc.get("content")); System.out.println("modified:" + doc.get("modified"));//修改时间没有存储,故获取不到 System.out.println("=========================="); } // 关闭资源 reader.close();
lucene原理
- 结构化数据:指具有固定格式或有限长度的数据,如数据库,元数据等
- 非结构化数据:指不定长或无固定格式的数据,如邮件,word文档等
- 顺序扫描法:对文档进行顺序扫描,找到所有符合的文档(或第一个),如windows的搜索文件内容
- 全文检索:使用索引进行检索.将非结构化数据中的一部分信息提取出来,重新组织,使其变成有一定结构的数据--索引,然后根据索引进行搜索,从而达到搜索相对较快的目的
- 创建文档对象:为每个文件对应的创建一个Document对象.把文件的属性都保存到document对象中.需要为每个属性创建一个field,把field添加到文档对象中.每个document都有一个唯一的编号
- 分析文档:针对document中的域进行分析,例如分析文件名,文件内容两个域.先把文件内容域中的字符串根据空格进行分词,把单词进行统一转换成小写.把没有意义的单词叫做停用词.把停用词从词汇列表中去掉.去掉标点符号.最终得到一个关键词列表.每个关键词叫做一个Term.Term中包含关键词及其所在的域,不同的域中相同的单词是不同的term.
- 创建索引:在关键词列表上创建一个索引,把索引和文档对象写入索引库,并记录关键词和文档对象的对应关系.每个关键词对应一链表,链表中的每个元素都是document对象的id.
- 查询索引:索引indexSearcher->关键词term->文档document
Lucene 工具luke
API使用
索引类
Directory
//文件FSDirectory fsDirectory = FSDirectory.open(Paths.get("D:\\lucene"));//内存RAMDirectory ramDirectory = new RAMDirectory();//双文件,四个参数:Set<String> primaryExtensions(标志第一个目录), Directory primaryDir(第一个目录), Directory secondaryDir(第一个目录), boolean doClose(是否真正进行关闭)Set<String> primaryExtensions = new HashSet<>();primaryExtensions.add("firestDirMark1");primaryExtensions.add("firestDirMark2");FileSwitchDirectory fileSwitchDirectory= new FileSwitchDirectory(primaryExtensions , fsDirectory, ramDirectory, true); //过滤目录,本质上就是对请求先拦截后转发给目标Directory处理,本身是个抽象类,lucene已经实现了好几个如SleepingLockWrapper,TrackingDirectoryWrapper功能性DirectoryFilterDirectory filterDirectory = new FilterDirectory(fsDirectory) { @Override public void deleteFile(String name) throws IOException { System.out.println("befor deleteFile i want to do something"); in.deleteFile(name);//此处的in 指的就是fsDirectory }};从Directory的实现类看,就可以看出Directory作为存储目录,功能上也实现了如数据库比如目录关联,锁,过滤(类似于触发器)等一些功能,但是操作对象是索引结构,与数据库的如二叉树的数据结构还是有本质区别的.
Document
Document doc = new Document();doc.add(new StringField("key", "value",Field.Store.YES));//增String value = doc.get("key");//查doc.removeField("key");//删
Field
private boolean stored; private boolean tokenized = true; private boolean storeTermVectors; private boolean storeTermVectorOffsets; private boolean storeTermVectorPositions; private boolean storeTermVectorPayloads; private boolean omitNorms; private IndexOptions indexOptions = IndexOptions.NONE; private LegacyNumericType numericType; private boolean frozen; private int numericPrecisionStep = LegacyNumericUtils.PRECISION_STEP_DEFAULT; private DocValuesType docValuesType = DocValuesType.NONE; private int dimensionCount; private int dimensionNumBytes;默认不存储(stored=false),分词(tokenized=true),不索引(indexOptions=none).不同的Field使用FieldType的属性是不一样的.以StringField为例
/** Indexed, not tokenized, omits norms, indexes * DOCS_ONLY, not stored. */ public static final FieldType TYPE_NOT_STORED = new FieldType(); /** Indexed, not tokenized, omits norms, indexes * DOCS_ONLY, stored */ public static final FieldType TYPE_STORED = new FieldType(); static { TYPE_NOT_STORED.setOmitNorms(true); TYPE_NOT_STORED.setIndexOptions(IndexOptions.DOCS); TYPE_NOT_STORED.setTokenized(false); TYPE_NOT_STORED.freeze(); TYPE_STORED.setOmitNorms(true); TYPE_STORED.setIndexOptions(IndexOptions.DOCS); TYPE_STORED.setStored(true); TYPE_STORED.setTokenized(false); TYPE_STORED.freeze(); }不同的FieldType构成了Field区别其他Field的特色.关注几个常用Field之间属性对比,主要包括index(索引类型), tokenized(是否分词),是否存储,先看看索引类型IndexOptions:
- NONE:不索引
- DOCS : documents索引
- DOCS_AND_FREQS : documents and term frequencies索引
- DOCS_AND_FREQS_AND_POSITIONS : documents, frequencies and positions索引
- DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS : documents, frequencies, positions and offsets
/* Field *//* * Field本身就支持很多类型构造,Reader,TokenStream,BytesRef,String,byte[],同时提供FieldType,也支持set long,double等value * 其中value为BytesRef, BytesRef内置offset ,length,byte[]是三个属性 *//* * StringField * 1.还有另一种构造方法StringField(String name, BytesRef value, Store stored) * 2.tokenized=false,索引DOC,存储有stored决定,用户身份证号,订单号等 */StringField stringField = new StringField("name", "lucene", Field.Store.YES);/* * StoredField * 1.构造方法有多个,但最终都调用Field(String name, Object object, FieldType type),其中type没有就是默认StoredField.TYPE * 2.默认存储,索引NONE,tokenized=true * 3.主要用于实现存储,有些打分类的Field如NumericDocValuesField,只负责打分,存储需要用到此Field */StoredField storedField = new StoredField("name", "object",StoredField.TYPE);/* * TextField * 1.构造方法有多个,一类value为String:TextField(String name, String value, Store store),一类 * 为流TextField(String name, Reader reader),TextField(String name, TokenStream stream) * 2.默认索引DOCS_AND_FREQS_AND_POSITIONS,tokenized=true,对流不存储,对String取决于store * 3.主要用于text文件流. */TextField textField = new TextField("name", "lucene", Field.Store.YES);/* * 打分Field(以NumericDocValuesField为例,还有BinaryDocValuesField,CollationDocValuesField... * 区别是score的标准类型不一样,NumericDocValuesField是long,BinaryDocValuesField是 ByteRef) * 1.NumericDocValuesField(String name, long value) * 2.默认tokenized = true,不存储,索引none,存储需要另外使用StoredField * 3.此类Field还需要关注FieldType中的DocValuesType * DocValuesType:NONE,NUMERIC,BINARY,SORTED,SORTED_NUMERIC,SORTED_SET * 4.查询打分用,结合生产query的方法使用 * 范围打分 * Query newRangeQuery(String field, long lowerValue, long upperValue) * 精确打分 * Query newExactQuery(String field, long value) */NumericDocValuesField numericDocValuesField = new NumericDocValuesField("numeric", Long.MAX_VALUE);/* * 数值点Field(以BigIntegerPoint为例,还有DoublePoint,IntPoint..) * 1.BigIntegerPoint(String name, BigInteger... point),定义多个BigInteger,单存储以BytesRef形式 * 2.默认tokenized = true,不存储,索引none,存储需要另外使用StoredField * 3.此类Field如double,float等还要关注FieldType中的 dimensionCount,dimensionNumBytes关于精度的参数 * 4.用与BigInteger查询: * <ul> * <li>{@link #newExactQuery(String, BigInteger)} for matching an exact 1D point. * <li>{@link #newSetQuery(String, BigInteger...)} for matching a set of 1D values. * <li>{@link #newRangeQuery(String, BigInteger, BigInteger)} for matching a 1D range. * <li>{@link #newRangeQuery(String, BigInteger[], BigInteger[])} for matching points/ranges in n-dimensional space. * </ul> */BigIntegerPoint bigIntegerPoint = new BigIntegerPoint("bigInteger", BigIntegerPoint.MIN_VALUE,BigIntegerPoint.MAX_VALUE);/* * 数值区域Field(以FloatRange为例,还有IntRange,IntRange..),与point不同,作用的是一连串点 * 1. FloatRange(String name, final float[] min, final float[] max),定义多个BigInteger,单存储以BytesRef形式 * 2.默认tokenized = true,不存储,索引none,存储需要另外使用StoredField * 3.此类Field如double,float等还要关注FieldType中的 dimensionCount,dimensionNumBytes关于精度的参数 * 4.用与Float范围查询: * <p> * This field defines the following static factory methods for common search operations over float ranges: * <ul> * <li>{@link #newIntersectsQuery newIntersectsQuery()} matches ranges that intersect the defined search range. * <li>{@link #newWithinQuery newWithinQuery()} matches ranges that are within the defined search range. * <li>{@link #newContainsQuery newContainsQuery()} matches ranges that contain the defined search range. * </ul> */FloatRange floatRange = new FloatRange("float",new float[]{Float.MIN_VALUE},new float[]{Float.MAX_VALUE});/*<<<<<<<<待续<<<<<<<<<<<<<<<*/只有区别了不同Field之间的区别,在不同的使用场景使用起来才会得心应手
IndexWriter
Directory dir = FSDirectory.open(Paths.get("D:\\lucene"));IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());iwc.setOpenMode(OpenMode.CREATE);//创建iwc.setMaxBufferedDocs(100);//maxBufferedDocs/*...*/IndexWriter writer = new IndexWriter(dir, iwc);//操作documentwriter.addDocument(new Document());writer.deleteDocuments(new Term(""));//操作Directorywriter.addIndexes(dir);//关闭writer.close();
search类
IndexSearcher
IndexSearcher api使用时,关注点在于其不同的search方法/*IndexSearcher*/// 1.指定索引地址Directory dir1 = FSDirectory.open(Paths.get("D:\\lucene"));//2.创建reader,使用不同Directory,此处应使用不同reader IndexReader reader = DirectoryReader.open(dir1); //3.创建indexSearcher IndexSearcher searcher = new IndexSearcher(reader); //4.搜索结果 /*searcher的search方式有多种,看一个参数相对完整的 * searchAfter(FieldDoc after, Query query, int numHits, Sort sort,boolean doDocScores, boolean doMaxScore) * after:上一次排序后的结果集 * query:查询规则 * numHits:结果集top,类似于分页 * sort:排序 * doDocScores:true:score重新被计算 * doMaxScore:true:最高分被重新计算 * */ searcher.search(new TermQuery(new Term("")), 100, new Sort());
Query
/*Query * 1.类比sql的查询,query查询也分两种,一种语句查询,一种封装的客户端查询 * 2.Query的实现类大多数针对某个属性单一操作,比如TermQuery(单元查询),TermRangeQuery(范围查询),FuzzyQuery("模糊查询")... * 3.不同的Query需要使用联接器BooleanQuery.Builder * */ //BooleanBuilder的使用 BooleanQuery.Builder builder = new BooleanQuery.Builder(); /* 单个occur MUST,FILTER,SHOULD表达的意思差不多,只有两两组合才会有所不同,有MUST(FILTER)必须满足, * 两个SHOULD满足一个即可,各种组合的区别细分还有打分的细微区别,需要注意 */ Occur o = Occur.MUST;// 类似ANDo = Occur.FILTER;// 类似must,只是查询是不参与scopeo = Occur.MUST_NOT;// 类似NOTo = Occur.SHOULD;// 两个should表示或//Occur只是表示联接,TermQuery表示的是属性相等,假如此处使用的是FuzzyQuery,则表示属性的模糊查询builder.add(new TermQuery(new Term("name", "")), o);builder.add(new TermQuery(new Term("name", "")), o);builder.add(new TermQuery(new Term("name", "")), o);//前缀builder BooleanQuery.Builder.add(PrefixQuery,Occur.SHOULD)PrefixCodedTerms.Builder preBuilder = new PrefixCodedTerms.Builder();//等同于preBuilder.add(new Term("name", ""));/* * 每个Query都有一个内置Builder,但同PrefixCodedTerms.Builder, * 每个query基本都支持多个term,一般多个term之间的关系是都是should(或)拼接(也有must), * (这个需要一个个去验证了...),同时每个Query都有自己的特色,并不是 * 简单的比较value,还有可能有额外的一些特色限制之类的(PhraseQuery设置词组之间的距离) * 基本都可以用BooleanQuery.Builder实现相同的功能,故不作一一介绍 ...... *//*不同query的使用:简单介绍,具体应用需要结合代码测试*//*TermQuery:最常用的term查询,属性相等关系*///单个 TermQuery termQuery = new TermQuery(new Term("name", "")); //多个,表示匹配一个field中的多个term,第二个参数是term的value集合--TermInSetQuery(String field, Collection<BytesRef> terms) TermInSetQuery termInSetQuery = new TermInSetQuery("fieldName", new ArrayList<>()); /*WildcardQuery:通配符查询(通配符后文会提到)*/ WildcardQuery wildcardQuery = new WildcardQuery(new Term("name", "*")); /*PhraseQuery: 短语查询(多个must)*/ PhraseQuery phraseQuery = new PhraseQuery("filedName","hello world","hello china");//同时包含hello world,hello china的filedName /*MultiPhraseQuery:多短语查询,多个OR连接*/ Builder multiBuilder = new MultiPhraseQuery.Builder().add(new Term("name", "hello world")).add(new Term("name", "hello china"));//包含hello world,hello china之一即可 /*不一一介绍,简单列举,详情,查看API,结合案例做测试 * FuzzyQuery:模糊查询 * RegexpQuery:正则查询 * TermRangeQuery:范围查询 * PointRangeQuery:范围点查询 * ConstantScoreQuery:常量分查询 * DisjunctionMaxQuery:最高分联合查询 * MatchAllDocsQuery:查询所有 * */ /* * 通过queryparser来创建query对象,更为灵活 * * */ /* QueryParser(String f, Analyzer a)-f 默认field,a 解析器*/ QueryParser parser = new QueryParser("name", new StandardAnalyzer()); /* Query parse(String query) query要符合一定语法Syntax*/ Query query = parser.parse("title:hello"); /* Query Parser Syntax * 官网链接:https://lucene.apache.org/core/6_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description *1.格式---参数:输入的lucene的查询语句(关键字一定要大写,关键字有AND,OR,NOT),没有:时,指向默认的field *2.通配符: * Wildcard Searches * ? : 一个占位符 * * : 多个个占位符(不能与?连用) * * Fuzzy Searches * ~ : 模糊度查询,后可接数字(计算表示前后模糊的字母数),比如roam~ 可以搜索到foam与roams * * Range Searches * {} : 范围不包含边界 * [] : 包含边界 * * Regular Expression Searches * /RegExp/ : 真正表达式 * * Proximity Searches * "jakarta apache"~10 : 距离查询,表示jakarta 与 apache在10个单词距离以内 * * Boosting a Term * xx^4 : ^后表示增量值,表示相似度,增量值越高,搜索到的项相关度越好 * * Boolean Operators * AND : 与 * OR : 或 * NOT : 与!作用相同,AND,OR,NOT都不能单用,比如NOT一般是xxx NOT xxx,即满足前者不满足后者 * - : 类似NOT,可以单用 * + : 类似AND,可以单用 * * Grouping * () : 表示分组,可以调整语法执行顺序(也就是括号的作用) * * Escaping Special Characters * + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /等需要通过'/'转义,或者通过API转义QueryParser.escape * */
TopDocs
/*topDocs结果集*/ TopDocs topDocs = searcher.search(query, 100);//类似于分页,还有一种使用(scoreDoc,query,n)从指定的scoreDoc查询 往后查询n条 int count = topDocs.totalHits;// 根据查询条件匹配出的记录总数 ScoreDoc[] scoreDocs = topDocs.scoreDocs;// 根据查询条件匹配出的记录 for (ScoreDoc scoreDoc : scoreDocs) { int docId = scoreDoc.doc;// 获取文档的ID System.out.println(docId); Document doc2 = searcher.doc(docId); // 通过ID获取文档 }
分词
package lucene;import java.io.IOException;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.cjk.CJKAnalyzer;import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;import org.apache.lucene.analysis.core.SimpleAnalyzer;import org.apache.lucene.analysis.core.StopAnalyzer;import org.apache.lucene.analysis.core.WhitespaceAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;public class TestAnalyzer {public static void main(String[] args) throws Exception {// Lucene自带分词器蛮多,测试效果也仅仅是以管窥豹,可见一斑,但此测试也仅仅稍稍了解各分词器之间的区别String s = "I love China and WuHan,我爱中国和武汉";//空格分词System.out.println("WhitespaceAnalyzer");Analyzer analyzer = new WhitespaceAnalyzer();analyze(analyzer,s);//[I] [love] [China] [and] [WuHan,我爱中国和武汉] //简单分词System.out.println("SimpleAnalyzer");analyzer = new SimpleAnalyzer(); analyze(analyzer,s);//[i] [love] [china] [and] [wuhan] [我爱中国和武汉] //排除分词(去除一些白名单如and..)System.out.println("StopAnalyzer");analyzer = new StopAnalyzer(); analyze(analyzer,s);//[i] [love] [china] [wuhan] [我爱中国和武汉] //标准分词(一字分割)System.out.println("StandardAnalyzer");analyzer = new StandardAnalyzer(); analyze(analyzer,s);//[i] [love] [china] [wuhan] [我] [爱] [中] [国] [和] [武] [汉] //CJK分词(两字分隔)System.out.println("CJKAnalyzer");analyzer = new CJKAnalyzer(); analyze(analyzer,s);//[i] [love] [china] [wuhan] [我爱] [爱中] [中国] [国和] [和武] [武汉] //中文分词System.out.println("SmartChineseAnalyzer");/*<dependency>//需要pom.xml依赖<groupId>org.apache.lucene</groupId><artifactId>lucene-analyzers-smartcn</artifactId><version>6.5.1</version></dependency>*/analyzer = new SmartChineseAnalyzer(true);//此处是true是默认使用内部自带的stopwords.txt停止词,主要是一些标点符号/*analyzer = new SmartChineseAnalyzer(new CharArraySet(list , true));List<String> list = new ArrayList<>();list.add(",");*/analyze(analyzer,s);//[i] [love] [china] [and] [wuhan] [我] [爱] [中国] [和] [武汉] /*对于中文的博大精深,其分词器也相对广泛,很多都是为了适应中文搜索,自己研发的分词器,maven仓库中没有需要单独下载jar * IKAnalyzer PaodingAnalyzer MMAnalyzer MIK_CAnalyzer... * 但是一些分词器很多都停止更新,已经不适应lucene的Analyzer(定义的分词基本都依赖Analyzer) * 比如Analyzer tokenStream方法为final,不允许覆盖,但IKAnalyzer overrides final method tokenStream,运行就会报错 * * IKAnalyzer;https://code.google.com/archive/p/ik-analyzer/downloads * * */}private static void analyze(Analyzer analyzer, String s) {try {TokenStream tokenStream = analyzer.tokenStream(null,s);CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);tokenStream.reset();while (tokenStream.incrementToken()) {System.out.print("[" + charTermAttribute + "] ");}System.out.println();} catch (IOException e) {e.printStackTrace();}}}
IndexReader,IndexWriter,lock以及优化
锁机制
- 锁对象基于Directory
- 有两种锁,write.lock(写锁),commit.lock(提交锁,在segment合并时出现)
- 一个directory的锁只能同时只能被一个对象用于(如IndexWriter)
package lucene.lock;import java.io.IOException;import java.nio.file.Paths;import java.text.SimpleDateFormat;import java.util.Date;import java.util.UUID;import java.util.concurrent.TimeUnit;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.LongPoint;import org.apache.lucene.document.StringField;import org.apache.lucene.document.TextField;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.IndexWriterConfig.OpenMode;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.LockObtainFailedException;public class TestIndexWriter {public static SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");public static Analyzer analyzer = new StandardAnalyzer();public static Directory dir;static {try {dir = FSDirectory.open(Paths.get("D:\\lucene"));} catch (IOException e) {e.printStackTrace();}}public static void main(String[] args) throws Exception {for (int i = 0; i < 3; i++) {new Thread(new Runnable() {@Overridepublic void run() {try {addDoc();} catch (Exception e) {e.printStackTrace();}}}).start();//TimeUnit.SECONDS.sleep(4);/*此处放开注释 打印(放开注释等锁被释放)2017-08-29 20:23:58 Thread-0 add count:12017-08-29 20:24:02 Thread-1 add count:12017-08-29 20:24:06 Thread-2 add count:1不放开 打印fail to get write lockfail to get write lock2017-08-29 20:24:06 Thread-2 add count:1*****/}}public static void addDoc() throws Exception {Document doc = new Document();doc.add(new StringField("title", Math.random() * 10 + "", Field.Store.YES));// 标题,存储doc.add(new TextField("content", UUID.randomUUID().toString(), Field.Store.YES));// 内容,内容信息量过大,一般不存储,但是却可以实现分词doc.add(new LongPoint("modified", new Date().getTime()));// 修改时间,不存储IndexWriterConfig iwc = new IndexWriterConfig(analyzer);iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);IndexWriter writer = null;try {//此处会调用dir.obtainLock(IndexWriter.WRITE_LOCK_NAME);进行锁判断writer= new IndexWriter(dir, iwc);long addDocument = writer.addDocument(doc);System.out.println(sf.format(new Date()) + " " + Thread.currentThread().getName() + " add count:" + addDocument);TimeUnit.SECONDS.sleep(2);} catch (LockObtainFailedException e) {//没有获取到锁异常,此处的锁是write.lock,对应的异常LockObtainFailedExceptionSystem.out.println("fail to get write lock");}finally {if (writer != null) {writer.close();}}}
优化
- 对于并发,需保证Directory 锁只被一个IndexWriter获取,即保证获取到的是同一个IndexWriter,或者创建一个IndexWriter时保证前一个IndexWriter已经close
- 对IndexReader,则需要保证实时索引,当索引变化时,需要实时更新IndexReader
package lucene.lock;import java.io.IOException;import java.nio.file.Paths;import java.text.SimpleDateFormat;import java.util.Date;import java.util.UUID;import java.util.concurrent.TimeUnit;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.LongPoint;import org.apache.lucene.document.StringField;import org.apache.lucene.document.TextField;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.IndexWriterConfig.OpenMode;import org.apache.lucene.queryparser.classic.QueryParser;import org.apache.lucene.search.ControlledRealTimeReopenThread;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.search.ReferenceManager;import org.apache.lucene.search.SearcherFactory;import org.apache.lucene.search.SearcherManager;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;public class TestOpt {public static ReferenceManager<IndexSearcher> searcherManager;public static ControlledRealTimeReopenThread<IndexSearcher> controlledRealTimeReopenThread;public static IndexWriter writer;static {Directory dir;try {dir = FSDirectory.open(Paths.get("D:\\lucene"));/* * ReferenceManager 可以直接指向一个Directory, 或者通过Write间接指向Directory * SearcherManager(IndexWriter writer, boolean applyAllDeletes, * boolean writeAllDeletes, SearcherFactory searcherFactory) */searcherManager = new SearcherManager(dir, new SearcherFactory());IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);writer = new IndexWriter(dir, iwc);/* * double targetMaxStaleSec 没有请求是打开reader的最大时长 double * targetMinStaleSec 有请求是打开reader的最短时长 * * 内置一个searchingGen,即generation标记,针对6版本之前TrackingIndexWriter,6版本之后此类已删除 * ControlledRealTimeReopenThread记录了当前已打开的代数,当期望更新代数大于已打开代数时, * 就表示有用户期望获得最新Search,waitForGeneration(long targetGen)类似起到一种join的效果 */controlledRealTimeReopenThread = new ControlledRealTimeReopenThread<>(writer, searcherManager, 5, 0.025);controlledRealTimeReopenThread.setDaemon(true);// 设置为后台服务controlledRealTimeReopenThread.setName("test_real_time_reopen_thread");// 线程名称controlledRealTimeReopenThread.start();// 线程启动} catch (IOException e) {e.printStackTrace();}}public static SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");public static void main(String[] args) throws Exception {// 查询new Thread(new Runnable() {@Overridepublic void run() {IndexSearcher searcher = getIndexSearcher();try {search(searcher);searcherManager.release(searcher);} catch (Exception e) {e.printStackTrace();}}}).start();TimeUnit.SECONDS.sleep(1);// 增加new Thread(new Runnable() {@Overridepublic void run() {try {addDoc();} catch (Exception e) {e.printStackTrace();}}}).start();TimeUnit.SECONDS.sleep(1);// 再次查询new Thread(new Runnable() {@Overridepublic void run() {IndexSearcher searcher = getIndexSearcher();try {search(searcher);searcherManager.release(searcher);} catch (Exception e) {e.printStackTrace();}}}).start();// 应用结束后关闭searcherManager controlledRealTimeReopenThread,writerTimeUnit.SECONDS.sleep(4);searcherManager.close();// 先中断在关闭controlledRealTimeReopenThread.interrupt();// 主要是给阻塞的线程发中断信号,避免不能因为阻塞而不能正确closecontrolledRealTimeReopenThread.close();// 确保所有的已经commit//writer.commit();writer.close();}public static void addDoc() throws Exception {Document doc = new Document();doc.add(new StringField("title", Math.random() * 10 + "", Field.Store.YES));// 标题,存储doc.add(new TextField("content", UUID.randomUUID().toString(), Field.Store.YES));// 内容,内容信息量过大,一般不存储,但是却可以实现分词doc.add(new LongPoint("modified", new Date().getTime()));// 修改时间,不存储writer.addDocument(doc);writer.commit();// 不要close,要重复使用}public static void search(IndexSearcher searcher) throws Exception {QueryParser parser = new QueryParser("title", new StandardAnalyzer());Query query = parser.parse("*:*");System.out.println(sf.format(new Date()) + " " + Thread.currentThread().getName() + " search count:" + searcher.search(query, 10).totalHits);}private static IndexSearcher getIndexSearcher() {try {if (searcherManager.maybeRefresh()) {// true// 表示没有变化,有变化则refresh并返回true,false// 其他线程正在refresh...return searcherManager.acquire();} else {// 别的线程在刷新,此次获取不一定是最新的System.out.println("warm : the other Thread is refresh");return searcherManager.acquire();}} catch (IOException e) {e.printStackTrace();return null;}}}
高亮
package lucene;import java.io.IOException;import java.io.StringReader;import java.nio.file.Paths;import java.util.UUID;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.document.TextField;import org.apache.lucene.index.DirectoryReader;import org.apache.lucene.index.IndexReader;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.index.IndexWriterConfig;import org.apache.lucene.index.IndexWriterConfig.OpenMode;import org.apache.lucene.queryparser.classic.ParseException;import org.apache.lucene.queryparser.classic.QueryParser;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TopDocs;import org.apache.lucene.search.highlight.Fragmenter;import org.apache.lucene.search.highlight.Highlighter;import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;import org.apache.lucene.search.highlight.QueryScorer;import org.apache.lucene.search.highlight.SimpleHTMLFormatter;import org.apache.lucene.search.highlight.SimpleSpanFragmenter;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.RAMDirectory;public class TestHighlight {/* * <dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-highlighter</artifactId><version>6.5.1</version></dependency> * hightligt相关API * * 1.Fragmenter : 分片,原始字符串拆分成独立的片段,有三个实现类 * * NullFragmenter : 它将整个字符串作为单个片段返回,这适合于处理title域和前台文本较短的域, * 而对于这些域来说,我们是希望在搜索结果中全部展示。 * * SimpleFragmenter : {@link Fragmenter} implementation which breaks text up * into same-size fragments with no concerns over spotting sentence * boundaries. * 默认,是负责将文本拆分封固定字符长度的片段,但它并处理子边界。你可以指定每个片段的字符长度(默认情况100)但这类片段有点过于简单, * 在创建片段时,他并不限制查询语句的位置,因此对于跨度的匹配操作会轻易被拆分到两个片段中 * * SimpleSpanFragmenter : breaks text up into same-size fragments but does * not split up {@link Spans}, 尝试将让片段永远包含跨度匹配的文档。 * * 2.Scorer 两个实现类 * Fragmenter输出的是文本片段序列,而Highlighter必须从中挑选出最适合的一个或多个片段呈现给客户,为了做到这点, * Highlighter会要求Java接口Scorer来对每个片段进行评分。 * * QueryTermScorer 基于片段中对应Query的项数进行评分。 * QueryScorer 只对促成文档匹配的实际项进行评分。 * * 3.Encoder * 两个实现类 初始文本编码成外部格式 * * DefaultEncoder:默认情况下供Hightlighter使用,它并不对文本进行任何操作。 * SimpleHTMLEncoder:负责将文本编码成HTML,并忽略一些如<、>以及其它非ASCII等特殊字符。一旦完成编码, * 最后一步就是对片段进行格式化处理向用户展现。 * * 4.Formatter * 负责将片段转换成String形式,以及将被高亮显示的项一起用于搜索结果展示以及高亮显示 */public static void main(String[] args) throws Exception {// 1.指定索引地址RAMDirectory ramDirectory = new RAMDirectory();//Directory ramDirectory = FSDirectory.open(Paths.get("D:\\lucene"));// 2.创建indexWriterAnalyzer analyzer = new SmartChineseAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);// indexWriter的初始化iwc.setOpenMode(OpenMode.CREATE);IndexWriter writer = new IndexWriter(ramDirectory, iwc);index(writer);// 3.创建indexSearcher IndexReader reader = DirectoryReader.open(ramDirectory); IndexSearcher searcher = new IndexSearcher(reader); hightlight(searcher);}private static void hightlight(IndexSearcher searcher) throws ParseException, Exception, InvalidTokenOffsetsException { SmartChineseAnalyzer smartChineseAnalyzer = new SmartChineseAnalyzer();QueryParser parser = new QueryParser("content", smartChineseAnalyzer); Query query = parser.parse("中国"); TopDocs docs =searcher.search(query,10);//查找 System.out.println("searcherDoc()->中国:"+docs.totalHits); QueryScorer scorer=new QueryScorer(query); Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); /**自定义标注高亮文本标签*/ SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span class=\"hightlighterCss\">","</span>"); // Highlighter highlight=new Highlighter(scorer); Highlighter highlight=new Highlighter(formatter,scorer); highlight.setTextFragmenter(fragmenter); for(ScoreDoc doc:docs.scoreDocs){//获取查找的文档的属性数据 int docID=doc.doc; Document document =searcher.doc(docID); String value =document.get("content"); System.out.println(value); if (value != null) { TokenStream tokenStream = smartChineseAnalyzer.tokenStream("content", new StringReader(value)); String str= highlight.getBestFragment(tokenStream, value); System.out.println("查询结果:"+ str); /* 我爱<B>中国</B> 使用自定义标签 我爱<span class="hightlighterCss">中国</span>*/ } } }private static void index(IndexWriter writer) throws IOException {writer.deleteAll();String[] s = new String[]{"我爱中国","我爱冰冰","我爱小丸子"};for (int i = 0; i < 3; i++) { Document doc = new Document();//创建索引库的文档 doc.add(new TextField("content", s[i], Field.Store.YES));//内容,内容信息量过大,一般不存储,但是却可以实现分词 writer.addDocument(doc);}int count =writer.numDocs(); writer.forceMerge(100);//合并索引库文件 writer.close(); System.out.println("buildDocs()->存入索引库的数量:"+count); }}
sort
Sort sort = new Sort(); sort.setSort(new SortField("title", new FieldComparatorSource() {@Overridepublic FieldComparator<String> newComparator(String fieldname, int numHits, int sortPos, boolean reversed) {return new MyselfComparator(fieldname,numHits);}}));自定义排序规则
import java.io.IOException;import org.apache.lucene.index.LeafReaderContext;import org.apache.lucene.search.FieldComparator;import org.apache.lucene.search.LeafFieldComparator;import org.apache.lucene.search.Scorer;public class MyselfComparator extends FieldComparator<String> {private String[] values;private String fieldname;public MyselfComparator(String fieldname,int numHits) {this.values = new String[numHits];this.fieldname=fieldname;}@Overridepublic int compare(int slot1, int slot2) {//自定义排序规则if ("f hello ahj".equals(values[slot1])) {return 1;} else if("f hello ahj".equals(values[slot2])) {return -1;} else{return values[slot1].compareTo(values[slot2]);}}@Overridepublic void setTopValue(String value) {}@Overridepublic String value(int slot) {return values[slot];}@Overridepublic LeafFieldComparator getLeafComparator(final LeafReaderContext context) throws IOException {return new LeafFieldComparator() {@Overridepublic void setScorer(Scorer scorer) throws IOException {}@Overridepublic void setBottom(int slot) {}@Overridepublic void copy(int slot, int doc) throws IOException {values[slot] = context.reader().getSortedDocValues(fieldname).get(doc).utf8ToString();System.out.println(values[slot]);}@Overridepublic int compareTop(int doc) throws IOException {// TODO Auto-generated method stubreturn 0;}@Overridepublic int compareBottom(int doc) throws IOException {// TODO Auto-generated method stubreturn 0;}};}}
- lucene
- Lucene
- lucene
- lucene
- Lucene
- lucene
- lucene
- lucene
- Lucene
- Lucene
- lucene
- Lucene
- Lucene
- Lucene
- lucene
- lucene
- Lucene
- Lucene
- socket C/C++编程(10)linux server端面向对象化处理
- HDU 6133 Army Formations (树状数组, 2017 Multi-Univ Training Contest 8)
- 2181: GJJ的日常之暴富梦(数学)
- MySQL存储引擎--MyISAM与InnoDB区别
- BZOJ 1588 营业额统计 Splay
- lucene
- 【转载】10个最佳ES6特性
- Midland.Valley.Move.v2017.2.0.build.21565结构建模
- [LeetCode 318] Maximum Product of Word Lengths(Python)
- MonkeyDev
- HDU 6153 A Secret 扩展kmp
- C++11:std::function
- java多线程编程-volatile与synchronized之前的比较
- CentOS7网络配置