Solr Indexing(solr 官方文档)
来源:互联网 发布:淘宝网睡衣清仓特价 编辑:程序博客网 时间:2024/05/01 21:51
solr的索引可以接受不同的数据源,包括:xml,csv文件,数据库表和常见的文件格式的文件(word,PDF)
有三种常用的方式来加载数据到solr的索引中:
1、Solr Cell 框架提取二进制文件或结构化文件,如office word,PDF等
2、向solr服务器发送HTTP请求来加载xml文件
3、利用solr的java API
solr的索引包含一个或多个document,一个document可以包含多个field,field可以为空,利用field: unique ID (和数据库的主键类似),但该field不是必须的。
field的一般都会在schema.xml 文件中相应的field对应,然后按照该文件里定义的步骤进行解析,如果在schema.xml没有此filed,则会被忽略或者映射到schema.xml
中的dynamic field。
1、index handler
index handler是一种request handler,用来添加、删除和更新document。另外,导入大量的document的使用Tika或者Data Import Handler(用于结构化数据)、
XML:参数:Content-type:application/xml /Content-type: text/xml
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'
Json:参数:Content-Type:application/json or Content-Type: text/json.
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field
*/
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a
multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the
same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field
*/
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'
----------------------------------------------------------------------------------------------------------------------
note: Comments are not allowed in JSON, but duplicate names are
2、Solr Cell using Apache Tika//Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容
ExtractingRequestHandler;
3、DataImpothandler(重点)
The Data Import Handler (DIH) provides a mechanism for importing content from a data store and
indexing it. In addition to relational databases, DIH can index content from HTTP based data sourcessuch as
RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate
fields.
概念与术语
DataSource:数据源,基于URL
Entity:实体,一个实体被处理为一串document,包含多个字段,然后会被solr作为索引。对于关系型数据库,一个实体就是一个视图或一张表,然后被处理成一个或多个sql语句,对应里一行或多行行记录(一条记录对应一个document)和一列或多列(一列对应一个field)
Processor:处理器。用来冲数据源中提取数据,转换成document添加到索引中。一个自定义的实体处理器可以被重写来扩展或代替一个supplied。
Tranformer:转换器,修改field,创建一个新的field或通过一行记录形成多个document。主要用来修改日期和过滤HTML。可以使用公共可用接口自定义一个转换器。
配置:
solrconfig.xm;
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>
--------------------------------------------------------------------------------------------
(example/example-DIH/solr/db/conf/db-data-config.xml).
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver"
url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" password="secret"/>
<document>
<entity name="item" query="select * from item"
deltaQuery="select id from item where last_modified >
'${dataimporter.last_index_time}'">
<field column="NAME" name="name" />
<entity name="feature"
query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID from FEATURE where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
<field name="features" column="DESCRIPTION" />
</entity>
<entity name="item_category"
query="select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'"
deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where
last_modified > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from item where
ID=${item_category.ITEM_ID}">
<entity name="category"
query="select DESCRIPTION from category where ID =
'${item_category.CATEGORY_ID}'"
deltaQuery="select ID from category where last_modified >
'${dataimporter.last_index_time}'"
parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category
where CATEGORY_ID=${category.ID}">
<field column="description" name="cat" />
</entity>
</entity>
</entity>
</document>
</dataConfig>
a、Data Import Handler Commands、
DIH通过HTTPrequest发送到solr。
参数common:
abort(停止正在进行的操作);
delta-import(增量导入和变化检测,也支持clean、commit、optimize and debug parameters as full-import command)(SqlEntityProcessor)
full-import::该操作会立即返回,也将开启一个新的线程,并且status属性为busy,会保存操作开始时间在conf/dataimport。properties以用来以后的delta-import
reload-config:如果配置文件更改后,又不想重启solr
status:It returns statistics on the number of documents created, deleted,queries run, rows fetched, status, and so on.
show-config、
---------------------------------------------------------------
Property Writer:
<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"
directory="data" filename="my_dih.properties" locale="en_US" />
数据源:
你可以自定义一个数据源通过继承org.apache.solr.handler.dataimport.DataSource
可以使用的数据源类型:
ContentStreamDataSource:EntityProcesssor + DataSource<Reader>
Field ReaderDataSource:XML;XPathEntityProcessor + JDBC + FieldReader
<dataSource name="a1" driver="org.hsqldb.jdbcDriver" ... />
<dataSource name="a2" type=FieldReaderDataSource" />
<document>
<!-- processor for database -->
<entity name ="e1" dataSource="a1" processor="SqlEntityProcessor" pk="docid"
query="select * from t1 ...">
<!-- nested XpathEntity; the field in the parent which is to be used for
Xpath is set in the "datafield" attribute in place of the "url" attribute
-->
<entity name="e2" dataSource="a2" processor="XPathEntityProcessor"
dataField="e1.fieldToUseForXPath">
<!-- Xpath configuration follows --
...
</entity>
FileDataSource:和URLDataSource相同,但是它用来从磁盘抓取数据 属性:basePath、encoding
JdbcDatasource:默认数据源和SqlEntityProcessor联合使用
URLDataSource:XPathEntityProcessor
<dataSource name="a"
type="URLDataSource"
baseUrl="http://host:port/"
encoding="UTF-8"
connectionTimeout="5000"
readTimeout="10000"/>
---------------------------------------------------------------------------------------
Entity Processors
属性名:
datasource
name:requerid,唯一识别一个实体。
pk
processor:默认是SqlEntityProcessor当不是关系型数据库时是必须的含有的属性
onError:abort|skip|continue,废弃|跳过|继续使用该document
preimportDeleteQuery:在完全导入数据之前,来清除索引
postImportDeleteQuery:导入数据之后执行
rootEntity:
transformer:Optional。
cacheImpl"SortedMapBackedCache
cacheKey
cacheLookup
where
<entity name="product" query="select description,sku, manu from product" >
<entity name="manufacturer" query="select id, name from manufacturer"
cacheKey="id" cacheLookup="product.manu" cacheImpl="SortedMapBackedCache"/>
</entity>
the sqlentity processor
属性:
query:requited
deltaQuery:
parentDeltaQuery
deletedPKQuery
deltaImportQuery
The XPathEntityProcessor
<dataConfig>
<dataSource type="HttpDataSource" />
<document>
<entity name="slashdot"
pk="link"
url="http://rss.slashdot.org/Slashdot/slashdot"
processor="XPathEntityProcessor"
<!-- forEach sets up a processing loop ; here there are two
expressions-->
forEach="/RDF/channel | /RDF/item"
transformer="DateFormatTransformer">
<field column="source" xpath="/RDF/channel/title" commonField="true" />
<field column="source-link" xpath="/RDF/channel/link" commonField="true"/>
<field column="subject" xpath="/RDF/channel/subject" commonField="true" />
<field column="title" xpath="/RDF/item/title" />
<field column="link" xpath="/RDF/item/link" />
<field column="description" xpath="/RDF/item/description" />
<field column="creator" xpath="/RDF/item/creator" />
<field column="item-subject" xpath="/RDF/item/subject" />
<field column="date" xpath="/RDF/item/date"
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
<field column="slash-department" xpath="/RDF/item/department" />
<field column="slash-section" xpath="/RDF/item/section" />
<field column="slash-comments" xpath="/RDF/item/comments" />
</entity>
</document>
</dataConfig>
The MailEntityProcessor
<dataConfig>
<document>
<entity processor="MailEntityProcessor"
user="email@gmail.com"
password="password"
host="imap.gmail.com"
protocol="imaps"
fetchMailsSince="2009-09-20 00:00:00"
batchSize="20"
folders="inbox"
processAttachement="false"
name="sample_entity"/>
</document>
</dataConfig>
The TikaEntityProcessor
<dataConfig>
<dataSource type="BinFileDataSource" />
<document>
<entity name="tika-test" processor="TikaEntityProcessor"
url="../contrib/extraction/src/test-files/extraction/solr-word.pdf"
format="text">
<field column="Author" name="author" meta="true"/>
<field column="title" name="title" meta="true"/>
<field column="text" name="text"/>
</entity>
</document>
</dataConfig>
- Solr Indexing(solr 官方文档)
- Solr文档 (官方资料)
- solr 6.60官方文档
- solr 4.1入门官方文档
- Solr优化的官方文档
- 企业级搜索引擎Solr 第三章 索引数据(Indexing Data)
- Solr: Indexing XML with Lucene and REST
- Newegg Cassandra Secondary Indexing with Solr
- Solr 开发 I - Installation, Indexing and Search
- solr indexing 和基本的数据操作
- BroadleafCommerce官方关于内嵌SOLR的文档翻译
- Solr入门之官方文档6.0阅读笔记系列(一)
- Solr入门之官方文档6.0阅读笔记系列(二)
- Solr入门之官方文档6.0阅读笔记系列(三)
- Solr入门之官方文档6.0阅读笔记系列(四)
- Solr入门之官方文档6.0阅读笔记系列(七)
- Solr入门之官方文档6.0阅读笔记系列(十)
- Solr
- Activity中的四种启动模式
- 正则表达式-转义字符
- java 注解 @Retention @interface 元数据
- 数据库中两行相减
- 桶排序
- Solr Indexing(solr 官方文档)
- PHP 使用 Redis
- Codeforces Beta Round #3
- 屏幕强转实现
- 鱼是最后一个看到水的
- CSU 1000/1001/1002 A+B问题
- 通过 ulimit 改善系统性能
- Android 知识汇总
- 枚举和常量区别