Solr Indexing(solr 官方文档)

来源:互联网 发布:淘宝网睡衣清仓特价 编辑:程序博客网 时间:2024/05/01 21:51

solr的索引可以接受不同的数据源,包括:xml,csv文件,数据库表和常见的文件格式的文件(word,PDF)

有三种常用的方式来加载数据到solr的索引中:

1、Solr Cell 框架提取二进制文件或结构化文件,如office word,PDF等

2、向solr服务器发送HTTP请求来加载xml文件

3、利用solr的java API

solr的索引包含一个或多个document,一个document可以包含多个field,field可以为空,利用field: unique ID (和数据库的主键类似),但该field不是必须的。

field的一般都会在schema.xml 文件中相应的field对应,然后按照该文件里定义的步骤进行解析,如果在schema.xml没有此filed,则会被忽略或者映射到schema.xml

中的dynamic field。

1、index handler

index handler是一种request handler,用来添加、删除和更新document。另外,导入大量的document的使用Tika或者Data Import Handler(用于结构化数据)、

XML:参数:Content-type:application/xml /Content-type: text/xml

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'

Json:参数:Content-Type:application/json or Content-Type: text/json.

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'

----------------------------------------------------------------------------------------------------------------------

note: Comments are not allowed in JSON, but duplicate names are

2、Solr Cell using Apache Tika//Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容

ExtractingRequestHandler;

3、DataImpothandler(重点)

The Data Import Handler (DIH) provides a mechanism for importing content from a data store and
indexing it. In addition to relational databases, DIH can index content from HTTP based data sourcessuch as
RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate
fields.

概念与术语

DataSource:数据源,基于URL

Entity:实体,一个实体被处理为一串document,包含多个字段,然后会被solr作为索引。对于关系型数据库,一个实体就是一个视图或一张表,然后被处理成一个或多个sql语句,对应里一行或多行行记录(一条记录对应一个document)和一列或多列(一列对应一个field)

Processor:处理器。用来冲数据源中提取数据,转换成document添加到索引中。一个自定义的实体处理器可以被重写来扩展或代替一个supplied。

Tranformer:转换器,修改field,创建一个新的field或通过一行记录形成多个document。主要用来修改日期和过滤HTML。可以使用公共可用接口自定义一个转换器。

配置:

solrconfig.xm;

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>

--------------------------------------------------------------------------------------------

(example/example-DIH/solr/db/conf/db-data-config.xml).

<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver"
url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>

<document>
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">
<field column="NAME" name="name" />

<entity name="feature"
query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>
<entity name="item_category"
query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
<entity name="category"
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"


deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>

</entity>
</document>
</dataConfig>

a、Data Import Handler Commands、

DIH通过HTTPrequest发送到solr。

参数common:

abort(停止正在进行的操作);

delta-import(增量导入和变化检测,也支持clean、commit、optimize and debug parameters as full-import command)(SqlEntityProcessor)

full-import::该操作会立即返回,也将开启一个新的线程,并且status属性为busy,会保存操作开始时间在conf/dataimport。properties以用来以后的delta-import

reload-config:如果配置文件更改后,又不想重启solr

status:It returns statistics on the number of documents created, deleted,queries run, rows fetched, status, and so on.

show-config、

---------------------------------------------------------------

Property Writer:

<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"
directory="data" filename="my_dih.properties" locale="en_US" />

数据源:

你可以自定义一个数据源通过继承org.apache.solr.handler.dataimport.DataSource

可以使用的数据源类型:

ContentStreamDataSource:EntityProcesssor + DataSource<Reader>

Field ReaderDataSource:XML;XPathEntityProcessor + JDBC + FieldReader

<dataSource name="a1" driver="org.hsqldb.jdbcDriver" ... />
<dataSource name="a2" type=FieldReaderDataSource" />
<document>
<!-- processor for database -->
<entity name ="e1" dataSource="a1" processor="SqlEntityProcessor" pk="docid"
query="select * from t1 ...">
<!-- nested XpathEntity; the field in the parent which is to be used for
Xpath is set in the "datafield" attribute in place of the "url" attribute
-->
<entity name="e2" dataSource="a2" processor="XPathEntityProcessor"
dataField="e1.fieldToUseForXPath">
<!-- Xpath configuration follows --
...
</entity>

FileDataSource:和URLDataSource相同,但是它用来从磁盘抓取数据 属性:basePath、encoding

JdbcDatasource:默认数据源和SqlEntityProcessor联合使用

URLDataSource:XPathEntityProcessor

<dataSource name="a"
type="URLDataSource"
baseUrl="http://host:port/"
encoding="UTF-8"
connectionTimeout="5000"
readTimeout="10000"/>

---------------------------------------------------------------------------------------

Entity Processors

属性名:

datasource

name:requerid,唯一识别一个实体。

pk

processor:默认是SqlEntityProcessor当不是关系型数据库时是必须的含有的属性

onError:abort|skip|continue,废弃|跳过|继续使用该document

preimportDeleteQuery:在完全导入数据之前,来清除索引

postImportDeleteQuery:导入数据之后执行

rootEntity:

transformer:Optional。

cacheImpl"SortedMapBackedCache

cacheKey

cacheLookup

where

<entity name="product" query="select description,sku, manu from product" >
<entity name="manufacturer" query="select id, name from manufacturer"
cacheKey="id" cacheLookup="product.manu" cacheImpl="SortedMapBackedCache"/>
</entity>

the sqlentity processor

属性:

query:requited

deltaQuery:

parentDeltaQuery

deletedPKQuery

deltaImportQuery

The XPathEntityProcessor

<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
<!-- forEach sets up a processing loop ; here there are two
expressions-->
forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true"/>
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="date" xpath="/RDF/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
</entity>
</document>
</dataConfig>

The MailEntityProcessor

<dataConfig>
<document>
<entity processor="MailEntityProcessor"
user="email@gmail.com"
password="password"
host="imap.gmail.com"
protocol="imaps"
fetchMailsSince="2009-09-20 00:00:00"
batchSize="20"
folders="inbox"
processAttachement="false"
name="sample_entity"/>
</document>
</dataConfig>

The TikaEntityProcessor

<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="tika-test" processor="TikaEntityProcessor"
url="../contrib/extraction/src/test-files/extraction/solr-word.pdf"
format="text">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
</document>
</dataConfig>



0 0
原创粉丝点击