solr3.6到solr4.1升级，schema版本号区别等

来源：互联网发布：2016年欧洲杯网络直播编辑：程序博客网时间：2024/05/21 14:03

应甲方公司要求，将现在正在使用的solr 3.6版本升到4.1，没有怎么了解4.1，但是驱使我们去升级solr版本的一个原因就是solr 3.6的cache机制会导致memory leak。

下面是我的替换过程，还是花了蛮久功夫的，可能是没有经历solr4.0的缘故么？

首先还是基于war包的solr4.1环境的搭建，与3.6没有区别，其实solr本身的东西更新起来基本没有任何问题，问题的关键就在于自己定制的一些功能。

1.分词

在3.6版本中，solr.NGramTokenizerFactory是由org.apache.solr.analysis.BaseTokenizerFactory派生出来的，而在4.1中，取消了BaseTokenizerFactory这个类，而是直接从org.apache.lucene.analysis.util.TokenizerFactory这个类继承得到的。自定义的MyNGramFactory其实就是重写TokenizerFactory中的init和create方法。

2.similarity定制

在3.6版本中，为了提高fieldNorm的精度，修改了lucene-core-xxx.jar（solr lucene-core.jar自定义）。而在4.1版本中，利用4.1版本的lucene，而lucene-core-4.1.0.jar中新集成了很多similarity，因而新增了一个package来存放这些不同的similarity类。

因此将DefaultSimilarity中的idf方法修改为返回特定值（10），然后将DefaultSimilarity的父类TFIDFSimilarity中的NORM_TABLE按照上面链接中的方法修改即可。

3.schema中间有个version信息

在刚刚搭好4.1，将3.6版本下建的索引copy到data目录下以后正要测试时，发现一个问题：

即搜索“上海”找不到任何东西，debugQuery发现是parsedquery变成了“上海上海”，即对“上海”进行分词后得到了这样的一个合成的结果，只有1个term，这样显然是找不出任何结果的。而正确的parsedquery应该是：

想了好多办法，一直对比schema和solrconfig文件，没有发现任何问题，直到：
solr3.6到solr4.1升级，schema版本号区别等

问题找出来了，原因是我新建出来的库中，schema文件中的version版本为"1.1",而在3.6版本下我的所有schema文件中的version都是"1.5"，于是找到了

Schema version attribute in the root node

For the up-to-date documentation, see example example schema shipped with Solr

<schema name="example" version="1.5">  <!-- attribute "name" is the name of this schema and is only used for display purposes.       version="x.y" is Solr's version number for the schema syntax and        semantics.  It should not normally be changed by applications.       1.0: multiValued attribute did not exist, all fields are multiValued             by nature       1.1: multiValued attribute introduced, false by default        1.2: omitTermFreqAndPositions attribute introduced, true by default             except for text fields.       1.3: removed optional field compress feature       1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser            behavior when a single string produces multiple tokens.  Defaults             to off for version >= 1.4       1.5: omitNorms defaults to true for primitive field types             (int, float, boolean, string...)     -->

注意1.4提到了autoGeneratePhraseQuerie这个属性，继续找，发现这个：

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
  (the default) causes the query parser to
  generate phrase queries if multiple tokens are generated from a single
  non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
  will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
  Note that tends to not work well for non whitespace
  delimited languages. (yonik)

这下就知道是什么情况了~~~。所以说schema.xml文件中的这个version也是有讲究的，不是摆设！将version改为1.5，done.

0 0