solr进阶四:创建文件索引

来源:互联网 发布:中英互译软件哪个好 编辑:程序博客网 时间:2024/05/18 03:27

索引数据源并不会一定来自于数据库、XMLJSONCSV这类结构化数据,很多时候也来自于PDFwordhtmlwordMP3等这类非结构化数据,从这类非结构化数据创建索引,solr也给我们提供了很好的支持,利用的是apache tika

下面我们来看看在solr4.10中如何从pdf文件创建索引。

先配置文件索引

新建core,存储文件型索引,具体步骤参考:

http://blog.csdn.net/u011439289/article/details/41699009

导入jar

在工作目录下新建一个extract文件夹,用来存放solr扩展的jar包。

\solr_tomcat\solr\pdf_core\extract

拷贝\solr-4.10.2\dist下的solr-cell-4.10.2.jarextract文件夹中,接着把

\solr-4.10.2\contrib\extraction\lib下的索引jar包拷贝到extract文件夹中。

配置solrconfig.xml

添加请求解析配置:


<requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler" >         <lst name="defaults">          <str name="fmap.content">text</str>          <str name="lowernames">true</str>          <str name="uprefix">attr_</str>          <str name="captureAttr">true</str>         </lst>  </requestHandler>

指定依赖包位置:

<span style="font-size:18px;"><lib dir="extract" regex=".*\.jar" /></span>

注意,这个相对位置不是相对于配置文件所在文件夹位置,而是相对core主目录的。比如我的配置文件在\solr_tomcat\solr\pdf_core\conf, 但是我的jar包在\solr_tomcat\solr\pdf_core\extract那么我的相对路径就是extract而不是../extract

配置schema.xml,配置索引字段的类型,也就是field类型。

其中text_general类型我们用到2txt文件(stopwords.txtsynonyms.txt),这2txt文件在发布包示例core里面有位置在:\solr_tomcat\solr\collection1\conf,复制这2txt文件到新建的core下面的conf目录下,和schema.xml一个位置。

注意:如果是复制粘贴core来新建core的话,原来的配置文件有些field是已经定义的,要注意把重复定义的去掉一个!

<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>    <fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">       <analyzer type="index">         <tokenizer class="solr.StandardTokenizerFactory"/>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.LowerCaseFilterFactory"/>       </analyzer>       <analyzer type="query">         <tokenizer class="solr.StandardTokenizerFactory"/>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>         <filter class="solr.LowerCaseFilterFactory"/>       </analyzer>     </fieldType>

配置索引字段,也就是field

 

其中有个动态类型字段,attr_*,这个是什么意思呢。也就是solr在解析文件的时候,文件本身有很多属性,具体有哪些属性是不确定的,solr全部把他解析出来以attr作为前缀加上文件本身的属性名,组合在一起就成了field的名称

<field name="id"        type="string"       indexed="true"  stored="true"  multiValued="false" required="true"/>   <field name="text"      type="text_general" indexed="true"  stored="true"/>   <field name="_version_" type="long"         indexed="true"  stored="true"/>      <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

到这里solr服务端的配置以及完成了。

 

测试类CreateIndexFromPDF.java

需要的jar包在前面《solr进阶一:java代码添加索引和增加IKAnalyzer分词支持》这篇文章有指定。 

Solrj4.10里面ContentStreamUpdateRequestaddFile方法多了一个contentType参数,指明内容类型。ContentType请参看:ContentType

import org.apache.solr.client.solrj.SolrQuery;import org.apache.solr.client.solrj.SolrServer;import org.apache.solr.client.solrj.SolrServerException;import org.apache.solr.client.solrj.impl.HttpSolrServer;import org.apache.solr.client.solrj.request.AbstractUpdateRequest;import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;import org.apache.solr.client.solrj.response.QueryResponse;import java.io.File;import java.io.IOException;/** * Created by Lhx on 14-12-4. */public class CreateIndexFromPDF {    public static void indexFilesSolr(String fileName, String solrId) throws IOException, SolrServerException {        String urlString = "http://localhost:8080/solr/pdf_core";        SolrServer solr = new HttpSolrServer(urlString);        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/extract");        String contentType = "application/pdf";        up.addFile(new File(fileName), contentType);        up.setParam("literal.id", solrId);        up.setParam("uprefix","attr_");        up.setParam("fmap.content","attr_content");        up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);        solr.request(up);        QueryResponse rsp = solr.query(new SolrQuery("*:*"));        System.out.println(rsp);    }    public static void main(String[] args) {        String fileName = "F:\\Sencha_Touch_2.0用户指南(中文版).pdf";        String solrId = "Sencha_Touch_2.0用户指南(中文版).pdf";        try {            indexFilesSolr(fileName,solrId);        } catch (IOException e) {            e.printStackTrace();        } catch (SolrServerException e) {            e.printStackTrace();        }    }}

执行上面代码,便把我们的pdf文件上传到solr服务器,解析、创建索引

后面的solr.query是执行一个查询,查询解析索引后结果。解析后pdf就变成了纯文本的内容,在控制台可以看到很多文档其他信息。

Solr解析完pdf、创建索引后,我们也可以在solr的管理界面查看索引结果。如下图。

选择“Query”,直接点击“Execute Query”按钮就可以了:

后记:

重启tomcat后报重复定义字段的错误,这个在前面的实践中就有这个错误,所以很快就在schema.xml中找到重复定义的idlong等类型字段,删掉就可以了。

接着启动tomcat,还是报出无法加载某某jar包的提示错误,后来才发现

<lib dir="extract" regex=".*\.jar" />

这个dir指定的目录地址写错了,导致tomcat报错。

启动tomcat后再也没有报错,在java控制台执行代码,报出以下错误:

原来是我把urlString地址写错了,写成了:

http://localhost:8080/solr

没有指定究竟上传到哪个指定的core里面,修改后就能提交PDF文档信息了。

 

附录:

solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?><!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at     http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.--><!-- This is a stripped down config file used for a simple example...   It is *not* a good example to work from. --><config>    <luceneMatchVersion>4.10.2</luceneMatchVersion>    <!--  The DirectoryFactory to use for indexes.          solr.StandardDirectoryFactory, the default, is filesystem based.          solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->    <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>    <dataDir>${solr.core0.data.dir:}</dataDir>    <!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:             <schemaFactory class="ManagedIndexSchemaFactory">           <bool name="mutable">true</bool>           <str name="managedSchemaResourceName">managed-schema</str>         </schemaFactory>                  When ManagedIndexSchemaFactory is specified, Solr will load the schema from         he resource named in 'managedSchemaResourceName', rather than from schema.xml.         Note that the managed schema resource CANNOT be named schema.xml.  If the managed         schema does not exist, Solr will create it after reading schema.xml, then rename         'schema.xml' to 'schema.xml.bak'.                   Do NOT hand edit the managed schema - external modifications will be ignored and         overwritten as a result of schema modification REST API calls.           When ManagedIndexSchemaFactory is specified with mutable = true, schema         modification REST API calls will be allowed; otherwise, error responses will be         sent back for these requests.     -->    <schemaFactory class="ClassicIndexSchemaFactory"/>    <updateHandler class="solr.DirectUpdateHandler2">        <updateLog>            <str name="dir">${solr.core0.data.dir:}</str>        </updateLog>    </updateHandler>    <!-- realtime get handler, guaranteed to return the latest stored fields       of any document, without the need to commit or open a new searcher. The current       implementation relies on the updateLog feature being enabled. -->    <requestHandler name="/get" class="solr.RealTimeGetHandler">        <lst name="defaults">            <str name="omitHeader">true</str>        </lst>    </requestHandler>    <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy"/>    <requestDispatcher handleSelect="true">        <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048"/>    </requestDispatcher>    <requestHandler name="standard" class="solr.StandardRequestHandler" default="true"/>    <requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler"/>    <requestHandler name="/update" class="solr.UpdateRequestHandler"/>    <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers"/>    <requestHandler name="/admin/ping" class="solr.PingRequestHandler">        <lst name="invariants">            <str name="q">solrpingquery</str>        </lst>        <lst name="defaults">            <str name="echoParams">all</str>        </lst>    </requestHandler>    <!--新添加的内容-->    <requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler">        <lst name="defaults">            <str name="fmap.content">text</str>            <str name="lowernames">true</str>            <str name="uprefix">attr_</str>            <str name="captureAttr">true</str>        </lst>    </requestHandler>    <lib dir="extract" regex=".*\.jar"/>    <!-- config for the admin interface -->    <admin>        <defaultQuery>solr</defaultQuery>    </admin></config>

schema.xml

<?xml version="1.0" ?><!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at     http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.--><schema name="example core zero" version="1.1">    <!-- general -->    <field name="type" type="string" indexed="true" stored="true" multiValued="false"/>    <field name="name" type="string" indexed="true" stored="true" multiValued="false"/>    <field name="core0" type="string" indexed="true" stored="true" multiValued="false"/>    <!-- field to use to determine and enforce document uniqueness. -->    <uniqueKey>id</uniqueKey>    <!-- field for the QueryParser to use when an explicit fieldname is absent -->    <defaultSearchField>name</defaultSearchField>    <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->    <solrQueryParser defaultOperator="OR"/>    <!--新添加的,其中long、String等字段原来配置文件就有,注意删除-->    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>    <fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">        <analyzer type="index">            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>            <filter class="solr.LowerCaseFilterFactory"/>        </analyzer>        <analyzer type="query">            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>            <filter class="solr.LowerCaseFilterFactory"/>        </analyzer>    </fieldType>    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>    <field name="text" type="text_general" indexed="true" stored="true"/>    <field name="_version_" type="long" indexed="true" stored="true"/>    <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/></schema>

参考文章:

Solr4.7从文件创建索引


0 0