Solr Indexing（solr 官方文档）

来源：互联网发布：淘宝网睡衣清仓特价编辑：程序博客网时间：2024/05/01 21:51

solr的索引可以接受不同的数据源，包括：xml，csv文件，数据库表和常见的文件格式的文件（word，PDF）

有三种常用的方式来加载数据到solr的索引中：

1、Solr Cell 框架提取二进制文件或结构化文件，如office word，PDF等

2、向solr服务器发送HTTP请求来加载xml文件

3、利用solr的java API

solr的索引包含一个或多个document，一个document可以包含多个field，field可以为空，利用field： unique ID （和数据库的主键类似），但该field不是必须的。

field的一般都会在schema.xml 文件中相应的field对应，然后按照该文件里定义的步骤进行解析，如果在schema.xml没有此filed，则会被忽略或者映射到schema.xml

中的dynamic field。

1、index handler

index handler是一种request handler，用来添加、删除和更新document。另外，导入大量的document的使用Tika或者Data Import Handler(用于结构化数据)、

XML：参数：Content-type:application/xml /Content-type: text/xml

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'

Json：参数：Content-Type:application/json or Content-Type: text/json.

curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'

----------------------------------------------------------------------------------------------------------------------

note: Comments are not allowed in JSON, but duplicate names are

2、Solr Cell using Apache Tika//Apache Tika 利用现有的解析类库，从不同格式的文档中（例如HTML, PDF, Doc)，侦测和提取出元数据和结构化内容

ExtractingRequestHandler;

3、DataImpothandler（重点）

The Data Import Handler (DIH) provides a mechanism for importing content from a data store and
indexing it. In addition to relational databases, DIH can index content from HTTP based data sourcessuch as
RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate
fields.

概念与术语

DataSource：数据源，基于URL

Entity：实体，一个实体被处理为一串document，包含多个字段，然后会被solr作为索引。对于关系型数据库，一个实体就是一个视图或一张表，然后被处理成一个或多个sql语句，对应里一行或多行行记录（一条记录对应一个document）和一列或多列（一列对应一个field）

Processor：处理器。用来冲数据源中提取数据，转换成document添加到索引中。一个自定义的实体处理器可以被重写来扩展或代替一个supplied。

Tranformer：转换器，修改field，创建一个新的field或通过一行记录形成多个document。主要用来修改日期和过滤HTML。可以使用公共可用接口自定义一个转换器。

配置：

solrconfig.xm;

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>

--------------------------------------------------------------------------------------------

(example/example-DIH/solr/db/conf/db-data-config.xml).

<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver"
url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>

<document>
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">
<field column="NAME" name="name" />

<entity name="feature"
query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>
<entity name="item_category"
query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
<entity name="category"
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>

</entity>
</document>
</dataConfig>

a、Data Import Handler Commands、

DIH通过HTTPrequest发送到solr。

参数common：

abort（停止正在进行的操作）；

delta-import（增量导入和变化检测，也支持clean、commit、optimize and debug parameters as full-import command）（SqlEntityProcessor）

full-import::该操作会立即返回，也将开启一个新的线程，并且status属性为busy，会保存操作开始时间在conf/dataimport。properties以用来以后的delta-import

reload-config:如果配置文件更改后，又不想重启solr

status：It returns statistics on the number of documents created, deleted,queries run, rows fetched, status, and so on.

show-config、

---------------------------------------------------------------

Property Writer：

数据源：

你可以自定义一个数据源通过继承org.apache.solr.handler.dataimport.DataSource

可以使用的数据源类型：

ContentStreamDataSource：EntityProcesssor + DataSource<Reader>

Field ReaderDataSource:XML;XPathEntityProcessor + JDBC + FieldReader

FileDataSource:和URLDataSource相同，但是它用来从磁盘抓取数据属性：basePath、encoding

JdbcDatasource：默认数据源和SqlEntityProcessor联合使用

URLDataSource：XPathEntityProcessor

---------------------------------------------------------------------------------------

Entity Processors

属性名：

datasource

name:requerid,唯一识别一个实体。

processor：默认是SqlEntityProcessor当不是关系型数据库时是必须的含有的属性

onError：abort|skip|continue,废弃|跳过|继续使用该document

preimportDeleteQuery:在完全导入数据之前，来清除索引

postImportDeleteQuery:导入数据之后执行

rootEntity：

transformer：Optional。

cacheImpl"SortedMapBackedCache

cacheKey

cacheLookup

where

the sqlentity processor

属性：

query:requited

deltaQuery:

parentDeltaQuery

deletedPKQuery

deltaImportQuery

The XPathEntityProcessor

<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"

forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true"/>
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="date" xpath="/RDF/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
</entity>
</document>
</dataConfig>

The MailEntityProcessor

The TikaEntityProcessor

0 0