Solr入门之官方文档6.0阅读笔记系列(十)

来源:互联网 发布:2018mpacc知乎 编辑:程序博客网 时间:2024/04/30 18:24
The Well-Configured Solr Instance
告诉你如何调节solr实例到最佳性能
Configuring solrconfig.xml

solrconfig.xml的配置对solr工作的影响很大

能完成以下内容:
request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the execution of special code, such as invoking some common queries to warm-up caches
the Request Dispatcher for managing HTTP communications
the Admin Web interface
parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)

主要讲述的内容:
DataDir and DirectoryFactory in SolrConfig
Lib Directives in SolrConfig
Schema Factory Definition in SolrConfig
IndexConfig in SolrConfig
RequestHandlers and SearchComponents in SolrConfig
InitParams in SolrConfig
UpdateHandlers in SolrConfig
Query Settings in SolrConfig
RequestDispatcher in SolrConfig
Update Request Processors
Codec Factory


Substituting Properties in Solr Config Files

solrconf,xml中支持动态设置属性值
${propertyname[:option default value]}
给予默认值或者运行时指定值或者报错

几种指定变量的方式:
JVM System Properties
Any JVM System properties, usually specified using the -D flag when starting the JVM, can be used as variables in any XML configuration file in Solr.

For example, in the sample solrconfig.xml files, you will see this value which defines the locking type to use:
<lockType>${solr.lock.type:native}</lockType>
Which means the lock type defaults to "native" but when starting Solr, you could override this using a JVM
system property by launching the Solr it with:
bin/solr start -Dsolr.lock.type=none
In general, any Java system property that you want to set can be passed through the bin/solr script using the
standard -Dproperty=value syntax. Alternatively, you can add common system properties to the SOLR_OPTS
environment variable defined in the Solr include file (bin/solr.in.sh). For more information about how the
Solr include file works, refer to: Taking Solr to Production.


设置参数的两种方式:
一个是启动时传入
一个是在solr的初始化文件中设置

solrcore.properties

If the configuration directory for a Solr core contains a file named solrcore.properties that file can contain
any arbitrary user defined property names and values using the Java standard properties file format, and those
properties can be used as variables in the XML configuration files for that Solr core.
For example, the following solrcore.properties file could be created in the conf/ directory of a collection
using one of the example configurations, to override the lockType used.
#conf/solrcore.properties
solr.lock.type=none


第二种方式使用 solrcore.properties

这个文件的名称和位置默认在conf下,可以使用core.properties中指定名称和位置

User defined properties from core.properties

For example, consider the following core.properties file:
#core.properties
name=collection2
my.custom.prop=edismax
The my.custom.prop property can then be used as a variable, such as in solrconfig.xml:
<requestHandler name="/select">
<lst name="defaults">
<str name="defType">${my.custom.prop}</str>
</lst>
</requestHandler>


Implicit Core Properties

隐式定义的核心属性:
All implicit properties use the solr.core. name prefix, and reflect the runtime value of the equivalent core.pr
operties property:
solr.core.name
solr.core.config
solr.core.schema
solr.core.dataDir
solr.core.transient
solr.core.loadOnStartup



DataDir and DirectoryFactory in SolrConfig

Specifying a Location for Index Data with the dataDir Parameter
通过dataDir指定索引数据的存放位置

<dataDir>/var/data/solr/</dataDir>
If you are using replication to replicate the Solr index (as described in Legacy Scaling and Distribution), then the <dataDir> directory should correspond to the index directory used in the replication configuration.

相对路径和绝对路径及副本设置

Specifying the DirectoryFactory For Your Index

You can force a particular implementation by specifying solr.MMapDirector
yFactory, solr.NIOFSDirectoryFactory, or solr.SimpleFSDirectoryFactory.
<directoryFactory name="DirectoryFactory"
class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
The solr.RAMDirectoryFactory is memory based, not persistent, and does not work with replication. Use
this DirectoryFactory to store your index in RAM.
<directoryFactory class="org.apache.solr.core.RAMDirectoryFactory"/>


不同操作系统采用不同的文件目录系统,还可以将索引建在hdfs上
solr.HdfsDirectoryFactory instead of either of the above implementations.

Lib Directives in SolrConfig

能够使用正则表达式,所有的位置都是相对solr实例:
All directories are resolved as relative to the Solr instanceDir


<lib dir="../../../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-cell-\d.*\.jar" />
<lib dir="../../../contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-clustering-\d.*\.jar" />
<lib dir="../../../contrib/langid/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-langid-\d.*\.jar" />
<lib dir="../../../contrib/velocity/lib" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-velocity-\d.*\.jar" />

Schema Factory Definition in SolrConfig

While the "read" features of the Solr API are supported for all Schema types, support for making Schema modifications programatically depends on the <schemaFactory/> in use.

Managed Schema Default

Solr implicitly uses a ManagedIndexSchemaFactory 

一个例子:
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>

mutable - controls whether changes may be made to the Schema data. This must be set to true to allow edits to be made with the Schema API.
managedSchemaResourceName is an optional parameter that defaults to "managed-schema", and defines a new name for the schema file that can be anything other than "schema.xml".

Classic schema.xml

disallows any programatic changes to the Schema at run time. 

<schemaFactory class="ClassicIndexSchemaFactory"/>

不支持运行时的修改,仅仅支持修改后重新加载生效模式

Switching from schema.xml to Managed Schema

可以将 不能编辑的schema.xml转为可编辑的 模式在solrconfig.xml中配置

Changing to Manually Edited schema.xml

改变为手动编辑的模式

步骤:

Rename the managed-schema file to schema.xml.
Modify solrconfig.xml to replace the schemaFactory class.
Remove any ManagedIndexSchemaFactory definition if it exists.
Add a ClassicIndexSchemaFactory definition as shown above Reload the core(s).
If you are using SolrCloud, you may need to modify the files via ZooKeeper.


IndexConfig in SolrConfig

In most cases, the defaults are fine
<indexConfig>
...
</indexConfig>

Parameters covered in this section:
Writing New Segments
Merging Index Segments
Compound File Segments
Index Locks
Other Indexing Settings


Writing New Segments

ramBufferSizeMB

<ramBufferSizeMB>100</ramBufferSizeMB>



maxBufferedDocs

<maxBufferedDocs>1000</maxBufferedDocs>


useCompoundFile

<useCompoundFile>false</useCompoundFile>

上面是文件的更新控制

Merging Index Segments

mergePolicyFactory
default in Solr is to use a TieredMergePolicy
Other policies available are the LogByteSizeMergePolicy and LogDocMergePolicy. 

<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="segmentsPerTier">10</int>
</mergePolicyFactory>

Controlling Segment Sizes: Merge Factors

For TieredMergePolicy, this is controlled by setting the <int name="maxMergeAtOnce"> and <int name="segmentsPerTier"> options, while LogByteSizeMergePolicy has a single <int name="mergeFactor"> option (all of which default to "10").

对于合并索引片段能加快搜索但是需要提交创建索引的时间

Customizing Merge Policies


一个例子:

<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
</mergePolicyFactory>

mergeScheduler

The merge scheduler controls how merges are performed


The default ConcurrentMergeScheduler  多线程
The alternative, SerialMergeScheduler,  串行线程

<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>


mergedSegmentWarmer

有利于近时时搜索

<mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>

Compound File Segments

复合文件片段

Index Locks
lockType
锁类型
StandardDirectoryFactory (the default)

native
simple
single
hdfs 

<lockType>native</lockType>


writeLockTimeout
写入锁的超时时间
<writeLockTimeout>1000</writeLockTimeout>

Other Indexing Settings

其余的一些参数:
reopenReaders  
deletionPolicy
infoStream

例子:
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
<str name="maxCommitsToKeep">1</str>
<str name="maxOptimizedCommitsToKeep">0</str>
<str name="maxCommitAge">1DAY</str>
</deletionPolicy>
<infoStream>false</infoStream>


RequestHandlers and SearchComponents in SolrConfig


Request Handlers
SearchHandlers
UpdateRequestHandlers
ShardHandlers
Other Request Handlers
Search Components
Default Components
First-Components and Last-Components
Components
Other Useful Components

Request Handlers

请求处理器和路径的映射关系

SearchHandlers

参数和特点

UpdateRequestHandlers

ShardHandlers

Other Request Handlers

其实solr中的处理器的类型也不是很多现在也就四五种,两种常用的,搜索和更新


Search Components

Search components define the logic that is used by the SearchHandler to perform queries for users.

对应的是搜索处理器

Default Components

除了用 first-components and last-component 来定义外,默认组件的顺序是:

query  solr.QueryComponent  Described in the section     Query Syntax and Parsing.
facet    solr.FacetComponent Described in the section       Faceting.
mlt      solr.MoreLikeThisComponent Described in the section MoreLikeThis.
highlight solr.HighlightComponent Described in the section Highlighting.
stats      solr.StatsComponent Described in the section The Stats Component.
debug   solr.DebugComponent Described in the section on Common Query Parameters
expand  solr.ExpandComponent Described in the section Collapse and Expand Results.


可以通过配置相同名称对默认的组件进行替换

First-Components and Last-Components


<arr name="first-components">
<str>mycomponent</str>
</arr>
<arr name="last-components">
<str>spellcheck</str>
</arr>

Components

如果不使用 first和last来添加组件,默认的组件将不启动

<arr name="components">
<str>mycomponent</str>
<str>query</str>
<str>debug</str>
</arr>


Other Useful Components

SpellCheckComponent, described in the section Spell Checking.
TermVectorComponent, described in the section The Term Vector Component.
QueryElevationComponent, described in the section The Query Elevation Component.
TermsComponent, described in the section The Terms Component.


InitParams in SolrConfig

An <initParams> section of solrconfig.xml allows you to define request handler parameters outside of the handler configuration.

<initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
<lst name="defaults">
<str name="df">_text_</str>
</lst>
</initParams>

给指定的处理器路径进行统一的默认配置

If we later want to change the /query request handler to search a different field by default, we could override the <initParams> by defining the parameter in the <requestHandler> section for /query
可以在当个路径中进行覆盖


Wildcards


例子:

<initParams name="myParams" path="/myhandler,/root/*,/root1/**">
<lst name="defaults">
<str name="fl">_text_</str>
</lst>
<lst name="invariants">
<str name="rows">10</str>
</lst>
<lst name="appends">
<str name="df">title</str>
</lst>
</initParams>

UpdateHandlers in SolrConfig
<updateHandler class="solr.DirectUpdateHandler2">
...
</updateHandler>


Topics covered in this section:
Commits
commit and softCommit
autoCommit
commitWithin
Event Listeners
Transaction Log


Commits

Data sent to Solr is not searchable until it has been committed to the index.

commit and softCommit

commit是硬提交,数据完全提交到硬盘中
softCommit 能快速的将索引可见,实现近实时索引,但是机器挂了会丢数据

autoCommit

<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>


<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>


commitWithin

for that reason the default is to perform a soft commit


<commitWithin>
<softCommit>false</softCommit>
</commitWithin>


With this configuration, when you call commitWithin as part of
 your update message, it will automatically perform a hard commit every time.


Event Listeners

These can be triggered to occur after any commit (event="postCommit") or only after optimize commands (event="postOptimize")

两种监听配置

监听到后可以进行相应的处理:
RunExecutableListener
有些参数--


Transaction Log

a transaction log is required for that feature. It is configured in the updateHandler section of solrconfig.xml.

<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>


有一些配置的参数;
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
<int name="numRecordsToKeep">500</int>
<int name="maxNumLogsToKeep">20</int>
<int name="numVersionBuckets">65536</int>
</updateLog>



Query Settings in SolrConfig

The settings in this section affect the way that Solr will process and respond to queries

<query>
...
</query>


Topics covered in this section:
Caches
Query Sizing and Warming
Query-Related Listeners



Caches

将查询的条件和结果缓存下来,当再次查询时从缓存中获取,提高查询速度.
当重新打卡索引时对缓存进行预热更新.
使用的有三种:
In Solr, there are three cache implementations: solr.search.LRUCache, solr.search.FastLRUCache, and solr.search.LFUCache .

filterCache

当使用fq参数查询时,会将条件和结果缓存下来,等待下次相同的查询条件命中,进行快速返回

<filterCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="128"/>


queryResultCache

This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort, and the range of documents requested

<queryResultCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="128"
maxRamMB="1000"/>



documentCache
This cache holds Lucene Document objects (the stored fields for each document). 
Since Lucene internal document IDs are transient, this cache is not auto-warmed. 

<documentCache class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>



User Defined Caches

自定义缓存

<cache name="myUserCache" class="solr.LRUCache"
size="4096"
initialSize="1024"
autowarmCount="1024"
regenerator="org.mycompany.mypackage.MyRegenerator" />


预热器的另一个配置:
regenerator="solr.NoOpRegenerator".


Query Sizing and Warming


maxBooleanClauses

最大布尔查询数量,依赖最后一个初始化配置:

<maxBooleanClauses>1024</maxBooleanClauses>



enableLazyFieldLoading


<enableLazyFieldLoading>true</enableLazyFieldLoading>


useFilterForSortedQuery

没有使用score进行排序时很有用

<useFilterForSortedQuery>true</useFilterForSortedQuery>

queryResultWindowSize

超范围查询结果缓存:大于指定数目:

<queryResultWindowSize>20</queryResultWindowSize>

queryResultMaxDocsCached


<queryResultMaxDocsCached>200</queryResultMaxDocsCached>


useColdSearcher

This setting controls whether search requests for which there is not a currently registered searcher should wait for a new searcher to warm up (false) or proceed immediately (true). When set to "false", requests will block until the searcher has warmed its caches.

<useColdSearcher>false</useColdSearcher>

maxWarmingSearchers

<maxWarmingSearchers>2</maxWarmingSearchers>

Query-Related Listeners

两种类型:

<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<!--
<lst><str name="q">solr</str><str name="sort">price asc</str></lst>
<lst><str name="q">rocks</str><str name="sort">weight asc</str></lst>
-->
</arr>
</listener>
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst><str name="q">static firstSearcher warming in solrconfig.xml</str></lst>
</arr>
</listener>



RequestDispatcher in SolrConfig
Topics in this section:
handleSelect Element
requestParsers Element
httpCaching Element



handleSelect Element
向后兼容

<requestDispatcher handleSelect="true" >
...
</requestDispatcher>


requestParsers Element

The <requestParsers> sub-element controls values related to parsing requests. This is an empty XML element that doesn't have any content, only attributes.

几个参数的介绍;
<requestParsers enableRemoteStreaming="true"
multipartUploadLimitInKB="2048000"
formdataUploadLimitInKB="2048"
addHttpRequestToContext="false" />


httpCaching Element

<httpCaching never304="false"
lastModFrom="openTime"
etagSeed="Solr">
<cacheControl>max-age=30, public</cacheControl>
</httpCaching>


cacheControl Element




Update Request Processors

Anatomy and life cycle
Configuration
Update processors in SolrCloud
Using custom chains
Update Request Processor Factories



Anatomy and life cycle

更新过程有默认的处理链,除非你配置了一个自己的处理链.
处理器要有处理器工厂,符合两个要求:
An update request processor need not be thread safe because it is used by one and only 
one requesthread and destroyed once the request is complete.
The factory class can accept configuration parameters and maintain any state that may be

required between requests. The factory class must be thread-safe.


Configuration

配置在solrconfig.xml中加载时就加载或者使用参数,运行时加载

自定义需要参考默认的处理器,一些必备的处理过程

The default update request processor chain

按照顺序:
LogUpdateProcessorFactory - Tracks the commands processed during this request and 
logs them
DistributedUpdateProcessorFactory - Responsible for distributing update requests to the right node e.g.
routing requests to the leader of the right shard and distributing updates from the leader to each replica. This processor is activated only in SolrCloud mode.                                             RunUpdateProcessorFactory - Executes the update using internal Solr APIs.



Custom update request processor chain

updateRequestProcessorChain
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Solr will automatically insert DistributedUpdateProcessorFactory in this chain that does not include it just prior to the RunUpdateProcessorFactory



Configuring individual processors as top-level plugins

updateProcessor

<updateProcessor class="solr.processor.SignatureUpdateProcessorFactory"
name="signature">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</updateProcessor>
<updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory"
name="remove_blanks"/>


接下来可以使用作为自定义的参数:
updateRequestProcessorChains and updateProcessors

<updateProcessorChain name="custom" processor="remove_blanks,signature">
<processor class="solr.RunUpdateProcessorFactory" />
</updateProcessorChain>

Update processors in SolrCloud

A critical SolrCloud functionality is the routing and distributing of requests – for update requests this routing is implemented by the DistributedUpdateRequestProcessor, and this processor is given a special status by Solr due to its important function.

更新处理器链中分布式更新处理时,分布式处理器之前是在接收到的节点进行处理,到分布式处理器后会进行路由的分发,到指定的lead节点处理,后进行日志记录,分发到副本进行处理;

举个栗子:

For example, consider the "dedupe" chain which we saw in a section above. Assume that a 3 node SolrCloud
cluster exists where node A hosts the leader of shard1, node B hosts the leader of shard2 and node C hosts the
replica of shard2. Assume that an update request is sent to node A which forwards the update to node B
(because the update belongs to shard2) which then distributes the update to its replica node C. Let's see what
happens at each node:
Node A: Runs the update through the SignatureUpdateProcessor (which computes the signature and puts
it in the "id" field), then LogUpdateProcessor and then DistributedUpdateProcessor. This processor
determines that the update actually belongs to node B and is forwarded to node B. The update is not
processed further. This is required because the next processor which is RunUpdateProcessor will execute
the update against the local shard1 index which would lead to duplicate data on shard1 and shard2.
Node B: Receives the update and sees that it was forwarded by another node. The update is directly sent
to DistributedUpdateProcessor because it has already been through the SignatureUpdateProcessor on
node A and doing the same signature computation again would be redundant. The DistributedUpdateProc
essor determines that the update indeed belongs to this node, distributes it to its replica on Node C and
then forwards the update further in the chain to RunUpdateProcessor.
Node C: Receives the update and sees that it was distributed by its leader. The update is directly sent to
DistributedUpdateProcessor which performs some consistency checks and forwards the update further in
the chain to RunUpdateProcessor.
In summary:
All processors before DistributedUpdateProcessor are only run on the first node that receives an update
request whether it be a forwarding node (e.g. node A in the above example) or a leader (e.g. node B). We
call these pre-processors or just processors.
All processors after DistributedUpdateProcessor run only on the leader and the replica nodes. They are
not executed on forwarding nodes. Such processors are called "post-processors".



post-processors

<updateProcessorChain name="custom" processor="signature"
post-processor="remove_blanks">
<processor class="solr.RunUpdateProcessorFactory" />
</updateProcessorChain>



Using custom chains

update.chain request parameter

你可以选择使用那个更新处理器链来处理请求

update.chain

curl
"http://localhost:8983/solr/gettingstarted/update/json?update.chain=dedupe&commit=tr
ue" -H 'Content-type: application/json' -d '
[
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
},
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
}
]'



processor & post-processor request parameters

使用这两个参数来构造一个动态的处理过程


Constructing a chain at request time


# Executing processors configured in solrconfig.xml as (pre)-processors
curl
"http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks,signa
ture&commit=true" -H 'Content-type: application/json' -d '
[
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
},
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
}
]'
# Executing processors configured in solrconfig.xml as pre and post processors
curl
"http://localhost:8983/solr/gettingstarted/update/json?processor=remove_blanks&postprocessor=signature&commit=true" -H 'Content-type: application/json' -d '
[
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
},
{
"name" : "The Lightning Thief",
"features" : "This is just a test",
"cat" : ["book","hardcover"]
}
]'


Configuring a custom chain as a default

将自己配置的自定义处理过程定义为默认的处理的两种方式:

This can be done by adding either "update.chain" or "processor" and "post-processor" as default parameter for a
given path which can be done either via InitParams in SolrConfig or by adding them in a "defaults" section which
is supported by all request handlers.


例子:
InitParams

<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">add-unknown-fields-to-the-schema</str>
</lst>
</initParams>



defaults

<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="update.chain">add-unknown-fields-to-the-schema</str>
</lst>
</requestHandler>



Update Request Processor Factories

有下列工厂类,具体功能见文档:

AddSchemaFieldsUpdateProcessorFactory:
CloneFieldUpdateProcessorFactory:
DefaultValueUpdateProcessorFactory:
DocBasedVersionConstraintsProcessorFactory:
DocExpirationUpdateProcessorFactory:
IgnoreCommitOptimizeUpdateProcessorFactory: 
RegexpBoostProcessorFactory:
SignatureUpdateProcessorFactory:
StatelessScriptUpdateProcessorFactory: 
TimestampUpdateProcessorFactory: 
URLClassifyProcessorFactory: 
UUIDUpdateProcessorFactory: 


FieldMutatingUpdateProcessorFactory derived factories

ConcatFieldUpdateProcessorFactory
CountFieldValuesUpdateProcessorFactory
FieldLengthUpdateProcessorFactory
FirstFieldValueUpdateProcessorFactory
HTMLStripFieldUpdateProcessorFactory
IgnoreFieldUpdateProcessorFactory
LastFieldValueUpdateProcessorFactory
MaxFieldValueUpdateProcessorFactory
MinFieldValueUpdateProcessorFactory
ParseBooleanFieldUpdateProcessorFactory
ParseDateFieldUpdateProcessorFactory
ParseNumericFieldUpdateProcessorFactory derived classes
      ParseDoubleFieldUpdateProcessorFactory: Attempts to mutate selected fields that have only CharSequence-typed values into Double values.
     ParseFloatFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only CharSequence-typed values into Float values.
      ParseIntFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only  CharSequence-typed values into Integer values.
     ParseLongFieldUpdateProcessorFactory : Attempts to mutate selected fields that have only  CharSequence-typed values into Long values.


PreAnalyzedUpdateProcessorFactory
RegexReplaceProcessorFactory :
RemoveBlankFieldUpdateProcessorFactory :
TrimFieldUpdateProcessorFactory:
TruncateFieldUpdateProcessorFactory:
UniqFieldsUpdateProcessorFactory :


Update Processor factories that can be loaded as plugins

可以自己扩展的接个插件工厂包:

LangDetectLanguageIdentifierUpdateProcessorFactory : 这个是google的??

TikaLanguageIdentifierUpdateProcessorFactory

UIMAUpdateRequestProcessorFactory

Update Processor factories you should not modify or remove

最好不要乱修改 solr的更新处理器工厂


Codec Factory

定义写入磁盘的编码方式,没有定义solr将使用默认值,在solrconfig.xml中定义
A compressionMode option:
 BEST_SPEED (default) is optimized for search speed performance
 BEST_COMPRESSION is optimized for disk space usage


例子:
<codecFactory class="solr.SchemaCodecFactory">
<str name="compressionMode">BEST_COMPRESSION</str>
</codecFactory>

Solr Cores and solr.xml

In Solr, the term core is used to refer to a single index and associated transaction log and configuration files (including the solrconfig.xml and Schema files, among others). 

In standalone mode, solr.xml must reside in solr_home. In SolrCloud mode, solr.xml will be loaded from Zookeeper if it exists, with fallback to solr_home.

The recommended way is to dynamically create cores/collections using the APIs

The following sections describe these options in more detail.
Format of solr.xml: Details on how to define solr.xml, including the acceptable parameters for the solr.xml file
Defining core.properties: Details on placement of core.properties and available property options.
CoreAdmin API: Tools and commands for core administration using a REST API.
Config Sets: How to use configsets to avoid duplicating effort when defining a new core.


Format of solr.xml

This section will describe the default solr.xml file included with Solr and how to modify it for your needs. For details on how to configure core.properties, see the section Defining core.properties.

Defining solr.xml
Solr.xml Parameters
The <solr> Element
The <solrcloud> element
The <logging> element
The <logging><watcher> element
The <shardHandlerFactory> element
Substituting JVM System Properties in solr.xml


Defining solr.xml
You can find solr.xml in your Solr Home directory or in Zookeeper. The default solr.xml file looks like this:
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<int name="zkClientTimeout">${zkClientTimeout:15000}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>

Unless the -DzkHost or -DzkRun are specified at startup time, this section is ignored.

Solr.xml Parameters

The <solr> Element

几个属性值的介绍

The <solrcloud> element

This section is ignored unless the solr instance is started with either -DzkRun or -DzkHost

solrcloud模式下的参数配置及访问控制令牌配置

The <logging> element

日志类及是否启用

The <logging><watcher> element

日志监控配置信息

The <shardHandlerFactory> element

定义分片处理器:
Custom shard handlers can be defined in solr.xml if you wish to create a custom shard handler.
<shardHandlerFactory name="ShardHandlerFactory" class="qualified.class.name">
Since this is a custom shard handler, sub-elements are specific to the implementation.



Substituting JVM System Properties in solr.xml

可以在 solr.xml中配置 jvm属性 ${propertyname[:option default value]} 设置默认值


动态设置jvm的属性值将覆盖设置的默认值
<solr>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>


Defining core.properties

core.properties文件是典型的javaproperties文件形式,例子:
name=my_core_name

Placement of core.properties

core.properties的位置在solr_home下的core文件中

Defining core.properties Files

name

The name of the SolrCore. You'll use this name to reference the SolrCore when running
commands with the CoreAdminHandler

config

The configuration file name for a given core. The default is solrconfig.xml.

schema

The schema file name for a given core. The default is schema.xml but please note that if
you are using a "managed schema" (the default behavior) then any value for this property
which does not match the effective managedSchemaResourceName will be read once,
backed up, and converted for managed schema use.


dataDir

The core's data directory (where indexes are stored) as either an absolute pathname, or a
path relative to the value of instanceDir. This is data by default.

configSet

The name of a defined configset, if desired, to use to configure the core 

properties

The name of the properties file for this core. The value can be an absolute pathname or a
path relative to the value of instanceDir

transient

If true, the core can be unloaded if Solr reaches the transientCacheSize. The default
if not specified is false. Cores are unloaded in order of least recently used first. 
Setting to true is not recommended in SolrCloud mode.

loadOnStartup

If true, the default if it is not specified, the core will loaded when Solr starts. Setting to fals
e is not recommended in SolrCloud mode.

coreNodeName

Used only in SolrCloud, this is a unique identifier for the node hosting this replica. By
default a coreNodeName is generated automatically, but setting this attribute explicitly
allows you to manually assign a new core to replace an existing replica. For example:
when replacing a machine that has had a hardware failure by restoring from backups on a
new machine with a new hostname or port..


ulogDir

The absolute or relative directory for the update log for this core (SolrCloud)

shard

The shard to assign this core to (SolrCloud)

collection

The name of the collection this core is part of (SolrCloud).

roles

Future param for SolrCloud or a way for users to mark nodes for their own use
这个不太理解

Additional "user defined" properties may be specified for use as variables. For more information on how to define local properties, see the section Substituting Properties in Solr Config Files.

用户自定义属性???

CoreAdmin API  适用于单机版本

SolrCloud users should not typically use the CoreAdmin API directly

solrcloud模式通常不直接使用coreadmin api

the cores running in that node and is accessible at the /solr/admin/cores path.

HTTP requests that specify an "action" request parameter

All action names are uppercase, and are defined in depth in the sections below

STATUS
CREATE
RELOAD
RENAME
SWAP
UNLOAD
MERGEINDEXES
SPLIT
REQUESTSTATUS


STATUS

The STATUS action returns the status of all running Solr cores, or status for only the named core.

http://localhost:8983/solr/admin/cores?action=STATUS&core=core0

Input

core 指定core的名字

indexInfo  是否返回索引的信息 默认返回,当数量过多时,加快返回可以设置为false

CREATE

The CREATE action creates a new core and registers it.

If a Solr core with the given name already exists, it will continue to handle requests while the new core is initializing. When the new core is ready, it will take new requests and the old core will be unloaded.

创建一个已经存在的core,旧的core将被替换掉

http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/dir&config=config_file_name.xml&dataDir=data


Input

name
instanceDir
config
schema
dataDir
configSet
collection
shard
property.name=value
async

Example
http://localhost:8983/solr/admin/cores?action=CREATE&name=my_core&collection=my_collection&shard=shard2

RELOAD

The RELOAD action loads a new core from the configuration of an existing, registered Solr core.

http://localhost:8983/solr/admin/cores?action=RELOAD&core=core0

Input

core


RENAME

The RENAME action changes the name of a Solr core.
http://localhost:8983/solr/admin/cores?action=RENAME&core=core0&other=core5

Input

core

other

async


SWAP

交换名字

http://localhost:8983/solr/admin/cores?action=SWAP&core=core1&other=core0


UNLOAD
卸载
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0

Input

MERGEINDEXES
合并索引

The MERGEINDEXES action merges one or more indexes to another index.

http://localhost:8983/solr/admin/cores?action=MERGEINDEXES&core=new_core_name&indexDir=/solr_home/core1/data/index&indexDir=/solr_home/core2/data/index

Alternatively, we can instead use a srcCore parameter, as in this example:
http://localhost:8983/solr/admin/cores?action=mergeindexes&core=new_core_name&srcCore=core1&srcCore=core2


SPLIT

The SPLIT action splits an index into two or more indexes. 

The SPLIT action supports five parameters, which are described in the table below

Input

core

path
多值,索引将被写入的目录

targetCore
多值,索引将被合并的目标solr core

ranges  不懂

split.key

async

Examples
The core index will be split into as many pieces as the number of path or targetCore parameters.

Usage with two targetCore parameters:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2


Usage of with two path parameters:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&path=/path/to/in
dex/1&path=/path/to/index/2

Usage with the split.key parameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&split.key=A!

Usage with ranges parameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&targetCore=core2&targetCore=core3&ranges=0-1f4,1f5-3e8,3e9-5dc


REQUESTSTATUS

查看异步请求状态

Input

requestid

http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1

Config Sets

在多个 core中分享配置文件的方式


On a multicore Solr instance, you may find that you want to share configuration between a number of different
cores. You can achieve this using named configsets, which are essentially shared configuration directories
stored under a configurable configset base directory.
To create a configset, simply add a new directory under the configset base directory. The configset will be
identified by the name of this directory. Then into this copy the config directory you want to share. The structure
should look something like this:
/<configSetBaseDir>
 /configset1
  /conf
     /managed-schema
    /solrconfig.xml
/configset2
  /conf
    /managed-schema
   /solrconfig.xml
 The default base directory is $SOLR_HOME/configsets, and it can be configured in solr.xml.
To create a new core using a configset, pass configSet as one of the core properties. For example, if you do
this via the core admin API:

http://<solr>/admin/cores?action=CREATE&name=mycore&instanceDir=path/to/instance&configSet=configset2




Configuration APIs

Solr includes several APIs that can be used to modify settings in solrconfig.xml.

修改solrconfig.xml

Blob Store API
Config API
Request Parameters API
Managed Resources


Blob Store API

The Blob Store REST API provides REST methods to store, retrieve or list files in a Lucene index.

The blob store is only available when running in SolrCloud mode

The blob store API is implemented as a requestHandler. A special collection named ".system" must be created as the collection that contains the blob store index.

Create a .system Collection

You can create the .system collection with the Collections API, as in this example:
curl   "http://localhost:8983/solr/admin/collections?action=CREATE&name=.system&replication
Factor=2"



Upload Files to Blob Store

After the .system collection has been created, files can be uploaded to the blob store with a request similar to
the following:
curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename}
http://localhost:8983/solr/.system/blob/{blobname}
For example, to upload a file named "test1.jar" as a blob named "test", you would make a POST request like:
curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @test1.jar
http://localhost:8983/solr/.system/blob/test
A GET request will return the list of blobs and other details:
curl http://localhost:8983/solr/.system/blob?omitHeader=true
Output:


{
"response":{"numFound":1,"start":0,"docs":[
{
"id":"test/1",
"md5":"20ff915fa3f5a5d66216081ae705c41b",
"blobName":"test",
"version":1,
"timestamp":"2015-02-04T16:45:48.374Z",
"size":13108}]
}
}
Details on individual blobs can be accessed with a request similar to:
curl http://localhost:8983/solr/.system/blob/{blobname}
For example, this request will return only the blob named 'test':
curl http://localhost:8983/solr/.system/blob/test?omitHeader=true
Output:
{
"response":{"numFound":1,"start":0,"docs":[
{
"id":"test/1",
"md5":"20ff915fa3f5a5d66216081ae705c41b",
"blobName":"test",
"version":1,
"timestamp":"2015-02-04T16:45:48.374Z",
"size":13108}]
}
}
The filestream response writer can return a particular version of a blob for download, as in:
curl http://localhost:8983/solr/.system/blob/{blobname}/{version}?wt=filestream >
{outputfilename}
For the latest version of a blob, the {version} can be omitted,
curl http://localhost:8983/solr/.system/blob/{blobname}?wt=filestream >
{outputfilename}


文件的上传,查看,及下载

Use a Blob in a Handler or Component

To use the blob as the class for a request handler or search component, you create a request handler in solrconfig.xml as usual. You will need to define the following parameters:

class: the fully qualified class name. For example, if you created a new request handler class calledCRUDHandler, you would enter org.apache.solr.core.CRUDHandler.

runtimeLib: Set to true to require that this component should be loaded from the classloader that loads the runtime jars.


Config API


This feature is enabled by default and works similarly in both SolrCloud and standalone mode. 

When using this API, solrconfig.xml is is not changed. Instead, all edited configuration is stored in a file called configoverlay.json. The values in configoverlay.json override the values in solrconfig.xml.

API Entry Points
Commands
Commands for Common Properties
Commands for Custom Handlers and Local Components
Commands for User-Defined Properties
How to Map solrconfig.xml Properties to JSON
Examples
Creating and Updating Common Properties 
Creating and Updating Request Handlers
Creating and Updating User-Defined Properties
How It Works
Empty Command 
Listening to config Changes



API Entry Points

/config: retrieve or modify the config. GET to retrieve and POST for executing commands
/config/overlay: retrieve the details in the configoverlay.json alone
/config/params : allows creating parameter sets that can override or take the place of parameters defined in solrconfig.xml. 



Commands

The config commands are categorized into 3 different sections which manipulate various data structures in solr  config.xml. Each of these is described below.
Common Properties
Components
User-defined properties


The common properties are those that are frequently need to be customized in a Solr instance. They are manipulated with two commands: 

set-property: Set a well known property. The names of the properties are predefined and fixed. If the property has already been set, this command will overwrite the previous setting.
unset-property: Remove a property set using the set-property command.


Commands for Custom Handlers and Local Components

大小写不敏感,添加 更新 删除 三种动作

The full list of available commands follows below:


General Purpose Commands

These commands are the most commonly used:
add-requesthandler
update-requesthandler
delete-requesthandler
add-searchcomponent
update-searchcomponent
delete-searchcomponent
add-initparams
update-initparams
delete-initparams
add-queryresponsewriter
update-queryresponsewriter
delete-queryresponsewriter



Advanced Commands

These commands allow registering more advanced customizations to Solr:
add-queryparser
update-queryparser
delete-queryparser
add-valuesourceparser
update-valuesourceparser
delete-valuesourceparser
add-transformer
update-transformer
delete-transformer
add-updateprocessor
update-updateprocessor
delete-updateprocessor
add-queryconverter
update-queryconverter
delete-queryconverter
add-listener
update-listener
delete-listener
add-runtimelib
update-runtimelib
delete-runtimelib


What about <updateRequestProcessorChain>?

The Config API does not let you create or edit <updateRequestProcessorChain> elements. However, it is possible to create <updateProcessor> entries and can use them by name to create a chain.
example:
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"add-updateprocessor" : { "name" : "firstFld", 
"class": "solr.FirstFieldValueUpdateProcessorFactory", 
"fieldName":"test_s"}}'
You can use this directly in your request by adding a parameter in the <updateRequestProcessorChain> for the specific update processor called processor=firstFld.


Commands for User-Defined Properties

Solr lets users templatize the solrconfig.xml using the place holder format ${variable_name:default_
val}. You could set the values using system properties, for example, -Dvariable_name= my_customvalue.
The same can be achieved during runtime using these commands:
set-user-property: Set a user-defined property. If the property has already been set, this command
will overwrite the previous setting.
unset-user-property: Remove a user-defined property.
The structure of the request is similar to the structure of requests using other commands, in the format of "comm
and":{"variable_name":"property_value"}. You can add more than one variable at a time if
necessary.


运行时设置jvm属性


How to Map solrconfig.xml Properties to JSON

将处理过程参数和map进行对应


Here is what a request handler looks like in solrconfig.xml:
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
</lst>
</requestHandler> 
The same request handler defined with the Config API would look like this:


{
"add-requesthandler":{
"name":"/query",
"class":"solr.SearchHandler",
"defaults":{
"echoParams":"explicit",
"wt":"json",
"indent":true
}
}
}
A searchComponent in solrconfig.xml looks like this:
<searchComponent name="elevator" class="solr.QueryElevationComponent" >
<str name="queryFieldType">string</str>
<str name="config-file">elevate.xml</str>
</searchComponent>
And the same searchComponent with the Config API:
{
"add-searchcomponent":{
"name":"elevator",
"class":"QueryElevationComponent",
"queryFieldType":"string",
"config-file":"elevate.xml"
}
}
Set autoCommit properties in solrconfig.xml:
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
Define the same properties with the Config API:
{
"set-property": {
"updateHandler.autoCommit.maxTime":15000,
"updateHandler.autoCommit.openSearcher":false
}
}


Name Components for the Config API

对于没有名字的,要强制给予一个名字

Examples

Creating and Updating Common Properties 
This change sets the query.filterCache.autowarmCountto 1000 items and unsets the query.filterCa
che.size.
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d'{
"set-property" : {"query.filterCache.autowarmCount":1000},
"unset-property" :"query.filterCache.size"}'
Using the /config/overlay endpoint, you can verify the changes with a request like this:
curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true
And you should get a response like this:
{
"overlay":{
"znodeVersion":1,
"props":{"query":{"filterCache":{
"autowarmCount":1000,
"size":25}}}}}



Creating and Updating Request Handlers

To create a request handler, we can use the add-requesthandler command:
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"add-requesthandler" : {
"name": "/mypath",
"class":"solr.DumpRequestHandler",
"defaults":{ "x":"y" ,"a":"b", "wt":"json", "indent":true },
"useParams":"x"
},
}'
Make a call to the new request handler to check if it is registered:
curl http://localhost:8983/solr/techproducts/mypath?omitHeader=true
And you should see the following as output:

{
"params":{
"indent":"true",
"a":"b",
"x":"y",
"wt":"json"},
"context":{
"webapp":"/solr",
"path":"/mypath",
"httpMethod":"GET"}} 


To update a request handler, you should use the update-requesthandler command :
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"update-requesthandler": {
"name": "/mypath",
"class":"solr.DumpRequestHandler",
"defaults": { "x":"new value for X", "wt":"json", "indent":true },
"useParams":"x"
}
}'
As another example, we'll create another request handler, this time adding the 'terms' component as part of the
definition:
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"add-requesthandler": {
"name": "/myterms",
"class":"solr.SearchHandler",
"defaults": { "terms":true, "distrib":false },
"components": [ "terms" ]
}
}'


Creating and Updating User-Defined Properties

his command sets a user property. 
curl http://localhost:8983/solr/techproducts/config
-H'Content-type:application/json' -d '{
"set-user-property" : {"variable_name":"some_value"}}'
Again, we can use the /config/overlay endpoint to verify the changes have been made:
curl http://localhost:8983/solr/techproducts/config/overlay?omitHeader=true
And we would expect to see output like this

{"overlay":{
"znodeVersion":5,
"userProps":{
"variable_name":"some_value"}}
}
To unset the variable, issue a command like this:
curl http://localhost:8983/solr/techproducts/config
-H'Content-type:application/json' -d '{"unset-user-property" : "variable_name"}'



How It Works

Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode,
however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node
using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a
core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch'
the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the
Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if
modified, the core is reloaded.
If params.json is modified, the params object is just updated without a core reload



当配置被更改时,zookeeper的监听器收到监听,做检查,发现更改后自动重新加载

Empty Command

If an empty command is sent to the /config endpoint, the watch is triggered on all cores using this configset.
For example:
curl http://localhost:8983/solr/techproducts/config
-H'Content-type:application/json' -d '{}'
Directly editing any files without 'touching' the directory will not make it visible to all nodes.
It is possible for components to watch for the configset 'touch' events by registering a listener using SolrCore#r
egisterConfListener() .


空命令触发

Listening to config Changes

Any component can register a listener using:
SolrCore#addConfListener(Runnable listener)
to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configo
verlay.xml or Schema). Components can use this to reload the files they are interested in. 


添加监听器


Request Parameters API

The Request Parameters API allows creating parameter sets that can override or take the place of parameters defined in solrconfig.xml.

In this case, the parameters are stored in a file named params.json. This file is
kept in ZooKeeper or in the conf directory of a standalone Solr instance.

The settings stored in params.json are used at query time to override settings defined in solrconfig.xml in some cases as described below.

When might you want to use this feature?
To avoid frequently editing your solrconfig.xml to update request parameters that change often.
To reuse parameters across various request handlers.
To mix and match parameter sets at request time.
To avoid a reload of your collection for small parameter changes.


The Request Parameters Endpoint
All requests are sent to the /config/params endpoint of the Config API.

Setting Request Parameters

The request to set, unset, or update request parameters is sent as a set of Maps with names. These objects can
be directly used in a request or a request handler definition.
The available commands are:
set: Create or overwrite a parameter set map.
unset: delete a parameter set map.
update: update a parameter set map. This is equivalent to a map.putAll(newMap) . Both the maps are
merged and if the new map has same keys as old they are overwritten
You can mix these commands into a single request if necessary.


Each map must include a name so it can be referenced later, either in a direct request to Solr or in a request
必须要有名字方便引用
handler definition.
In the following example, we are setting 2 sets of parameters named 'myFacets' and 'myQueries'.
curl http://localhost:8983/solr/techproducts/config/params -H
'Content-type:application/json' -d '{
"set":{
"myFacets":{
"facet":"true",
"facet.limit":5}},
"set":{
"myQueries":{
"defType":"edismax",
"rows":"5",
"df":"text_all"}}
}'


In the above example all the parameters are equivalent to the "defaults" in solrconfig.xml. It is possible to add invariants and appends as follows

curl http://localhost:8983/solr/techproducts/config/params -H
'Content-type:application/json' -d '{
"set":{
"my_handler_params":{
"facet.limit":5, 
"_invariants_": {
"facet":true,
"wt":"json"
},
"_appends_":{"facet.field":["field1","field2"]
}
}}
}'


now it is possible to define a request handler as follows
<requestHandler name="/my_handler" class="solr.SearchHandler"
useParams="my_handler_params"/>
It will be equivalent to a requesthandler definition as follows,
<requestHandler name="/my_handler" class="solr.SearchHandler">
<lst name="defaults">
<int name="facet.limit">5</int>
</lst>
<lst name="invariants>
<str name="wt">json</>
<bool name="facet">true<bool>
</lst>
<lst name="appends">
<arr name="facet.field">
<str>field1</str>
<str>field2</str>
</arr>
</lst>
</requestHandler>
Update example,
curl http://localhost:8983/solr/techproducts/config/params -H
'Content-type:application/json' -d '{
"update":{
"myFacets":{
"facet.limit":10}},
}'
This command will add (or replace) the facet.limit param to the myFacets map, keeping all other existing m
yFacets params.
To see the parameters that have been set, you can use the /config/params endpoint to read the contents of 
params.json, or use the name in the request:


curl http://localhost:8983/solr/techproducts/config/params
#Or use the params name
curl http://localhost:8983/solr/techproducts/config/params/myQueries


The useParams Parameter
When making a request, the useParams parameter applies the request parameters set to the request. This is
translated at request time to the actual params. 
For example (using the names we set up in the earlier example, please replace with your own name):
http://localhost/solr/techproducts/select?useParams=myQueries
It is possible to pass more than one parameter set in the same request. For example:
http://localhost/solr/techproducts/select?useParams=myFacets,myQueries
In the above example the param set 'myQueries' is applied on top of 'myFacets'. So, values in 'myQueries'
take precedence over values in 'myFacets'. Additionally, any values passed in the request take precedence over
'useParams' params. This acts like the "defaults" specified in the '<requestHandler>' definition in solrconfi
g.xml.
The parameter sets can be used directly in a request handler definition as follows. Please note that the
'useParams' specified is always applied even if the request contains useParams.
<requestHandler name="/terms" class="solr.SearchHandler" useParams="myQueries">
<lst name="defaults">
<bool name="terms">true</bool>
<bool name="distrib">false</bool>
</lst>
<arr name="components">
<str>terms</str>
</arr>
</requestHandler>


如何去使用定义的请求参数



To summarize, parameters are applied in this order:
parameters defined in <invariants> in solrconfig.xml.
parameters applied in _invariants_ in params.json and that is specified in the requesthandler definition or
even in request
parameters defined in the request directly. 
parameter sets defined in the request, in the order they have been listed with useParams.
parameter sets defined in params.json that have been defined in the request handler.
parameters defined in <defaults> in solrconfig.xml.



Public APIs
Java访问请求参数
The RequestParams Object can be accessed using the method SolrConfig#getRequestParams(). Each
paramset can be accessed by their name using the method RequestParams#getRequestParams(String
name).


Managed Resources

资源管理多种方式

All of the examples in this section assume you are running the "techproducts" Solr example:
bin/solr -e techproducts

Overview

Let's begin learning about managed resources by looking at a couple of examples provided by Solr for managing stop words and synonyms using a REST API. After reading this section, you'll be ready to dig into the details of how managed resources are implemented in Solr so you can start building your own implementation.

Stop words

To begin, you need to define a field type that uses the ManagedStopFilterFactory , such as:
<fieldType name="managed_en" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory"
managed="english" />
</analyzer>
</fieldType>


There are two important things to notice about this field type definition. First, the filter implementation class is so
lr.ManagedStopFilterFactory . This is a special implementation of the StopFilterFactory that uses a set of
stop words that are managed from a REST API. Second, the managed=”english” attribute gives a name to
the set of managed stop words, in this case indicating the stop words are for English text.
The REST endpoint for managing the English stop words in the techproducts collection is: /solr/techproduc
ts/schema/analysis/stopwords/english.
The example resource path should be mostly self-explanatory. It should be noted that the
ManagedStopFilterFactory implementation determines the /schema/analysis/stopwords part of the path, which
makes sense because this is an analysis component defined by the schema. It follows that a field type that uses
the following filter:
<filter class="solr.ManagedStopFilterFactory" 
managed="french" />
would resolve to path: /solr/techproducts/schema/analysis/stopwords/french.
So now let’s see this API in action, starting with a simple GET request:


curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"
Assuming you sent this request to Solr, the response body is a JSON document:
{
"responseHeader":{
"status":0,
"QTime":1
},
"wordSet":{
"initArgs":{"ignoreCase":true},
"initializedOn":"2014-03-28T20:53:53.058Z",
"managedList":[
"a",
"an",
"and",
"are",
... ]
}
}
The sample_techproducts_configs config set ships with a pre-built set of managed stop words, however
you should only interact with this file using the API and not edit it directly.
One thing that should stand out to you in this response is that it contains a  managedList of words as well as i
nitArgs . This is an important concept in this framework—managed resources typically have configuration and
data. For stop words, the only configuration parameter is a boolean that determines whether to ignore the case
of tokens during stop word filtering (ignoreCase=true|false). The data is a list of words, which is represented as a
JSON array named managedList in the response.
Now, let’s add a new word to the English stop word list using an HTTP PUT:
curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]'
"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"
Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words
set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.
You can test to see if a specific word exists by sending a GET request for that word as a child resource of the
set, such as:
curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"
This request will return a status code of 200 if the child resource (foo) exists or 404 if it does not exist the
managed list.
To delete a stop word, you would do:
curl -X DELETE
"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"
Note: PUT/POST is used to add terms to an existing list instead of replacing the list entirely. This is because it is
more common to add a term to an existing list than it is to replace a list altogether, so the API favors the more
common approach of incrementally adding terms especially since deleting individual terms is also supported



停用词的CURD操作

Synonyms

For the most part, the API for managing synonyms behaves similar to the API for stop words, except instead of
working with a list of words, it uses a map, where the value for each entry in the map is a set of synonyms for a
term. As with stop words, the sample_techproducts_configs config set includes a pre-built set of synonym
mappings suitable for the sample data that is activated by the following field type definition in schema.xml:
<fieldType name="managed_en" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory"
managed="english" />
<filter class="solr.ManagedSynonymFilterFactory"
managed="english" />
</analyzer>
</fieldType>


和停用词类似,同义词有一定不一样词的存放使用的是map的方式.

To get the map of managed synonyms, send a GET request to:
curl "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"
This request will return a response that looks like:
{
"responseHeader":{
"status":0,
"QTime":3},
"synonymMappings":{
"initArgs":{
"ignoreCase":true,
"format":"solr"},
"initializedOn":"2014-12-16T22:44:05.33Z",
"managedMap":{
"GB":
["GiB",
"Gigabyte"],
"TV":
["Television"],
"happy":
["glad",
"joyful"]}}}


Managed synonyms are returned under the managedMap property which contains a JSON Map where the value
of each entry is a set of synonyms for a term, such as "happy" has synonyms "glad" and "joyful" in the example
above.
To add a new synonym mapping, you can PUT/POST a single mapping such as:
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"mad":["angry","upset"]}'
"http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"
The API will return status code 200 if the PUT request was successful. To determine the synonyms for a specific
term, you send a GET request for the child resource, such as /schema/analysis/synonyms/english/mad


would return ["angry","upset"]. 
You can also PUT a list of symmetric synonyms, which will be expanded into a mapping for each term in the list.
For example, you could PUT the following list of symmetric synonyms using the JSON list syntax instead of a
map:
curl -X PUT -H 'Content-type:application/json' --data-binary '["funny",
"entertaining", "whimiscal", "jocular"]'
"http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"
Note that the expansion is performed when processing the PUT request so the underlying persistent state is still
a managed map. Consequently, if after sending the previous PUT request, you did a GET for /schema/analys
is/synonyms/english/jocular, then you would receive a list containing ["funny", "entertaining",
"whimiscal"]. Once you've created synonym mappings using a list, each term must be managed separately.
Lastly, you can delete a mapping by sending a DELETE request to the managed endpoint


同音字的curd操作

Applying Changes

Changes made to managed resources via this REST API are not applied to the active Solr components until the
Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop
word, you must reload the core/collection before changes become active.


重新加载才能生效

This approach is required when running in distributed mode so that we are assured changes are applied to all
cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that
you don’t want one of your replicas working with a different set of stop words or synonyms than the others.
One subtle outcome of this apply-changes-at-reload approach is that the once you make changes with the API,
there is no way to read the active data. In other words, the API returns the most up-to-date data from an API
perspective, which could be different than what is currently being used by Solr components. However, the intent
of this API implementation is that changes will be applied using a reload within a short time frame after making
them so the time in which the data returned by the API differs from what is active in the server is intended to be
negligible.

一个正确的启用流程及重新索引


RestManager Endpoint

Metadata about registered ManagedResources is available using the /schema/managed and /config/managed
endpoints for each collection. Assuming you have the managed_en field type shown above defined in your
schema.xml, sending a GET request to the following resource will return metadata about which schema-related
resources are being managed by the RestManager:
curl "http://localhost:8983/solr/techproducts/schema/managed"
The response body is a JSON document containing metadata about managed resources under
the /schema root:

{
"responseHeader":{
"status":0,
"QTime":3
},
"managedResources":[
{
"resourceId":"/schema/analysis/stopwords/english",
"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource",
"numObservers":"1"
},
{
"resourceId":"/schema/analysis/synonyms/english",
"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymMan
ager",
"numObservers":"1"
}
]
}
You can also create new managed resource using PUT/POST to the appropriate URL – before ever configuring
anything that uses these resources.
For example: imagine we want to build up a set of German stop words. Before we can start adding stop words,
we need to create the endpoint:
/solr/techproducts/schema/analysis/stopwords/german
To create this endpoint, send the following PUT/POST request to the endpoint we wish to create:
curl -X PUT -H 'Content-type:application/json' --data-binary \
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}' \
"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"
Solr will respond with status code 200 if the request is successful. Effectively, this action registers a new
endpoint for a managed resource in the RestManager. From here you can start adding German stop words as
we saw above:
curl -X PUT -H 'Content-type:application/json' --data-binary '["die"]' \
"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"
For most users, creating resources in this way should never be necessary, since managed resources are created
automatically when configured.
However: You may want to explicitly delete managed resources if they are no longer being used by a Solr
component.
For instance, the managed resource for German that we created above can be deleted because there are no
Solr components that are using it, whereas the managed resource for English stop words cannot be deleted
because there is a token filter declared in schema.xml that is using it.
curl -X DELETE
"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"


你可以定义一个资源管理和删除他!但是定义在schema.xml中的文件不能被删除!~


Solr Plugins


Solr allows you to load custom code to perform a variety of tasks within Solr, from custom Request Handlers to
process your searches, to custom Analyzers and Token Filters for your text field. You can even load custom
Field Types. These pieces of custom code are called plugins.
Not everyone will need to create plugins for their Solr instances - what's provided is usually enough for most
applications. However, if there's something that you need, you may want to review the Solr Wiki documentation
on plugins at SolrPlugins.
If you have a plugin you would like to use, and you are running in SolrCloud mode, you can use the Blob Store
API and the Config API to load the jars to Solr. The commands to use are described in the section Adding
Custom Plugins in SolrCloud Mode. 


solr可以自定义插件,如果你需要的话

Adding Custom Plugins in SolrCloud Mode

When running Solr in SolrCloud mode and you want to use custom code (such as custom analyzers, tokenizers,
query parsers, and other plugins), it can be cumbersome to add jars to the classpath on all nodes in your cluster.
Using the Blob Store API and special commands with the Config API, you can upload jars to a special
system-level collection and dynamically load plugins from them at runtime with out needing to restart any nodes.


通过命令上传jar到solrcloud中比每个jar放在节点下方便

This Feature is Disabled By Default
In addition to requiring that Solr by running in SolrCloud mode, this feature is also disabled by default
unless all Solr nodes are run with the -Denable.runtime.lib=true option on startup.
Before enabling this feature, users should carefully consider the issues discussed in the Securing
Runtime Libraries section below.


此功能默认是不可用的,需要启动时设置参数

Uploading Jar Files

The first step is to use the Blob Store API to upload your jar files. This will to put your jars in the .system collecti
on and distribute them across your SolrCloud nodes. These jars are added to a separate classloader and only
accessible to components that are configured with the property runtimeLib=true. These components are
loaded lazily because the .system collection may not be loaded when a particular core is loaded.


上传jar

Config API Commands to use Jars as Runtime Libraries

The runtime library feature uses a special set of commands for the Config API to add, update, or remove jar files
currently available in the blob store to the list of runtime libraries.
The following commands are used to manage runtime libs:
add-runtimelib
update-runtimelib
delete-runtimelib


curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"add-runtimelib": { "name":"jarblobname", "version":2 },
"update-runtimelib": { "name":"jarblobname", "version":3 },
"delete-runtimelib": "jarblobname"
}'


使用命令对jar操作

The name to use is the name of the blob that you specified when you uploaded your jar to the blob store. You
should also include the version of the jar found in the blob store that you want to use. These details are added to 
configoverlay.json. 
The default SolrResourceLoader does not have visibility to the jars that have been defined as runtime
libraries. There is a classloader that can access these jars which is made available only to those components
which are specially annotated.
Every pluggable component can have an optional extra attribute called runtimeLib=true, which means that
the components are not loaded at core load time. Instead, they will be loaded on demand. If all the dependent
jars are not available when the component is loaded, an error is thrown.
This example shows creating a ValueSourceParser using a jar that has been loaded to the Blob store.
curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"create-valuesourceparser": {
"name": "nvl",
"runtimeLib": true,
"class": "solr.org.apache.solr.search.function.NvlValueSourceParser,
"nvlFloatValue": 0.0 }
}'


需要资源管理去加载在运行时,设置一个参数.

Securing Runtime Libraries

A drawback of this feature is that it could be used to load malicious executable code into the system. However, it
is possible to restrict the system to load only trusted jars using PKI to verify that the executables loaded into the
system are trustworthy.
The following steps will allow you enable security for this feature. The instructions assume you have started all
your Solr nodes with the -Denable.runtime.lib=true.


这种方式有一个缺点就是有可能加载了恶意的jar到系统中.下面介绍如何保证安全的加载

Step 1: Generate an RSA Private Key

The first step is to generate an RSA private key. The example below uses a 512-bit key, but you should use the
strength appropriate to your needs. 
$ openssl genrsa -out priv_key.pem 512


使用rsa加密算法生成一个私钥

Step 2: Output the Public Key
The public portion of the key should be output in DER format so Java can read it.
$ openssl rsa -in priv_key.pem -pubout -outform DER -out pub_key.der


公钥部分的编码保证jar能识别

Step 3: Load the Key to ZooKeeper

The .der files that are output from Step 2 should then be loaded to ZooKeeper under a node /keys/exe so they
are available throughout every node. You can load any number of public keys to that node and all are valid. If a
key is removed from the directory, the signatures of that key will cease to be valid. So, before removing the a
key, make sure to update your runtime library configurations with valid signatures with the update-runtimeli
b command.
At the current time, you can only use the ZooKeeper zkCli.sh (or zkCli.cmd on Windows) script to issue
these commands (the Solr version has the same name, but is not the same). If you are running the embedded
ZooKeeper that is included with Solr, you do not have this script already; in order to use it, you will need to
download a copy of ZooKeeper v3.4.6 from http://zookeeper.apache.org/. Don't worry about configuring the
download, you're just trying to get the command line utility script. When you start the script, you will connect to
the embedded ZooKeeper. If you have your own ZooKeeper ensemble running already, you can find the script
in $ZK_INSTALL/bin/zkCli.sh (or zkCli.cmd if you are using Windows).
To load the keys, you will need to connect to ZooKeeper with zkCli.sh, create the directories, and then create
the key file, as in the following example.
# Connect to ZooKeeper
# Replace the server location below with the correct ZooKeeper connect string for
your installation.
$ .bin/zkCli.sh -server localhost:9983
# After connection, you will interact with the ZK prompt.
# Create the directories
[zk: localhost:9983(CONNECTED) 5] create /keys
[zk: localhost:9983(CONNECTED) 5] create /keys/exe
# Now create the public key file in ZooKeeper
# The second path is the path to the .der file on your local machine
[zk: localhost:9983(CONNECTED) 5] create /keys/exe/pub_key.der
/myLocal/pathTo/pub_key.der
After this, any attempt to load a jar will fail. All your jars must be signed with one of your private keys for Solr to
trust it. The process to sign your jars and use the signature is outlined in Steps 4-6.



使用zookeeper命令行创建指定的文件夹和将公钥上传到指定的位置


Step 4: Sign the jar File
Next you need to sign the sha1 digest of your jar file and get the base64 string. 
$ openssl dgst -sha1 -sign priv_key.pem myjar.jar | openssl enc -base64
The output of this step will be a string that you will need to add the jar to your classpath in Step 6 below.


为你的jar生成一个签名,保存下来


Step 5: Load the jar to the Blob Store
Load your jar to the Blob store, using the Blob Store API. This step does not require a signature; you will need
the signature in Step 6 to add it to your classpath.
curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename} 
http://localhost:8983/solr/.system/blob/{blobname}
The blob name that you give the jar file in this step will be used as the name in the next step.


将jar上传到.system 集合中  这步和原来没区别

Step 6: Add the jar to the Classpath

Finally, add the jar to the classpath using the Config API as detailed above. In this step, you will need to provide the signature of the jar that you got in Step 4

curl http://localhost:8983/solr/techproducts/config -H
'Content-type:application/json' -d '{
"add-runtimelib": {
"name":"blobname", 
"version":2,
"sig":"mW1Gwtz2QazjfVdrLFHfbGwcr8xzFYgUOLu68LHqWRDvLG0uLcy1McQ+AzVmeZFBf1yLPDEHBWJb5
KXr8bdbHN/
PYgUB1nsr9pk4EFyD9KfJ8TqeH/ijQ9waa/vjqyiKEI9U550EtSzruLVZ32wJ7smvV0fj2YYhrUaaPzOn9g0
=" }
}



使用步骤4中获取的签名,将上传的jar加载到classpath下,已供使用!

JVM Settings

Configuring your JVM can be a complex topic. A full discussion is beyond the scope of this document. Luckily,
most modern JVMs are quite good at making the best use of available resources with default settings. The
following sections contain a few tips that may be helpful when the defaults are not optimal for your situation.
For more general information about improving Solr performance, see https://wiki.apache.org/solr/SolrPerformanceFactors.

默认的jvm配置已经是很好的,如果你有特殊的需要也可以设置.

Choosing Memory Heap Settings

JVM 的 两个参数 
These are -Xms,
which sets the initial size of the JVM's memory heap, and -Xmx, which sets the maximum size to which the heap
is allowed to grow.


及jvm的垃圾回收和IO性能问题的考虑

Use the Server HotSpot VM

If you are using Sun's JVM, add the -server command-line option when you start Solr. This tells the JVM that it
should optimize for a long running, server process. If the Java runtime on your system is a JRE, rather than a full
JDK distribution (including javac and other development tools), then it is possible that it may not support the -s
erver JVM option. Test this by running java -help and look for -server as an available option in the
displayed usage message.


使用  -server 参数 当运行solr的时候

Checking JVM Settings

A great way to see what JVM settings your server is using, along with other useful information, is to use the
admin RequestHandler, solr/admin/system. This request handler will display a wealth of server statistics and
settings.
You can also use any of the tools that are compatible with the Java Management Extensions (JMX). See the
section Using JMX with Solr in Managing Solr for more information.

如何去查看solr中的jvm参数信息

Managing Solr

This section describes how to run Solr and how to look at Solr when it is running. It contains the following
sections:
Taking Solr to Production: Describes how to install Solr as a service on Linux for production environments.
Securing Solr: How to use the Basic and Kerberos authentication and rule-based authorization plugins for Solr,
and how to enable SSL.
Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.
Making and Restoring Backups of SolrCores: Describes backup strategies for your Solr indexes.
Configuring Logging: Describes how to configure logging for Solr.
Using JMX with Solr: Describes how to use Java Management Extensions with Solr.
MBean Request Handler: How to use Solr's MBeans for programmatic access to the system plugins and stats.


可以看到这部分内容还是挺丰富的呢
Taking Solr to Production

在生产环境使用solr
This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.
Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and then
provide tips on how to support multiple Solr nodes running on the same host.
Service Installation Script
Planning your directory structure
Solr Installation Directory
Separate Directory for Writable Files
Create the Solr user
Run the Solr Installation Script
Solr Home Directory
Environment overrides include file
Log settings
init.d script
Progress Check
Fine tune your production setup
Memory and GC Settings
Out-of-Memory Shutdown Hook
SolrCloud
ZooKeeper chroot
Solr Hostname
Override settings in solrconfig.xml
Enable Remote JMX Access
Running multiple Solr nodes per host



Service Installation Script

Solr includes a service installation script (bin/install_solr_service.sh) to help you install Solr as a service on Linux. Currently, the script only supports Red Hat, Ubuntu, Debian, and SUSE Linux distributions.
Before running the script, you need to determine a few parameters about your setup. Specifically, you need to
decide where to install Solr and which system user should be the owner of the Solr files and process.


使用 bin/install_solr_service.sh 来快速的安装solr实例

Planning your directory structure

We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr
distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a
system administrator.


给出的solr安装目录的建议

Solr Installation Directory
By default, the service installation script will extract the distribution archive into /opt. You can change this
location using the -i option when running the installation script. The script will also create a symbolic link to the
versioned directory of Solr. For instance, if you run the installation script for Solr X.0.0, then the following
directory structure will be used:
/opt/solr-X.0.0
/opt/solr -> /opt/solr-X.0.0
Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the road,
you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the upgraded
version of Solr. We’ll use /opt/solr to refer to the Solr installation directory in the remaining sections of this
page.


solr默认安装将生成一个软连接,以后升级时,仅仅需要更新对应的solr真实目录即可.

Separate Directory for Writable Files

You should also separate writable Solr files into a different directory; by default, the installation script uses /var
/solr, but you can override this location using the -d option. With this approach, the files in /opt/solr will
remain untouched and all files that change while Solr is running will live under /var/solr.


solr默认的软件安装位置,及数据安装位置(可以指定)

Create the Solr user

Running Solr as root is not recommended for security reasons. Consequently, you should determine the
username of a system user that will own all of the Solr files and the running Solr process. By default, the
installation script will create the solr user, but you can override this setting using the -u option. If your
organization has specific requirements for creating new user accounts, then you should create the user before
running the script. The installation script will make the Solr user the owner of the /opt/solr and /var/solr di
rectories.
You are now ready to run the installation script.


考虑安全性,需要为solr指定一个非root用户,默认脚本会自己创建一个solr用户,这个也可以指定

Run the Solr Installation Script

To run the script, you'll need to download the latest Solr distribution archive and then do the following (NOTE:
replace solr-X.Y.Z with the actual version number):
$ tar xzf solr-X.Y.Z.tgz solr-X.Y.Z/bin/install_solr_service.sh --strip-components=2
The previous command extracts the install_solr_service.sh script from the archive into the current
directory. If installing on Red Hat, please make sure lsof is installed before running the Solr installation script (su
do yum install lsof). The installation script must be run as root:
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz
By default, the script extracts the distribution archive into /opt, configures Solr to write files into /var/solr,

and runs Solr as the solr user. Consequently, the following command produces the same result as the previous
command:
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s
solr -p 8983
You can customize the service name, installation directories, port, and owner using options passed to the
installation script. To see available options, simply do:
$ sudo bash ./install_solr_service.sh -help
Once the script completes, Solr will be installed as a service and running in the background on your server (on
port 8983). To verify, you can do:
$ sudo service solr status
We'll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment. Before
moving on, let's take a closer look at the steps performed by the installation script. This gives you a better
overview and will help you understand important details about your Solr installation when reading other pages in
this guide; such as when a page refers to Solr home, you'll know exactly where that is on your system.


一些solr安装命令的例子及状态的查看,也可以查看参数自己设置适合自己的

我的尝试:
solr-6.0.1/bin/install_solr_service.sh solr-6.0.1.zip -i /usr/local/  -d /zyy/solr 

这个相当于其余的都使用默认的 
$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -i /opt -d /var/solr -u solr -s
solr -p 8983 
这条就是默认的全命令


Solr Home Directory

The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core
directories with index files. By default, the installation script uses /var/solr/data. If the -d option is used on
the install script, then this will change to the data subdirectory in the location given to the -d option. Take a
moment to inspect the contents of the Solr home directory on your system. If you do not store solr.xml in
ZooKeeper, the home directory must contain a solr.xml file. When Solr starts up, the Solr start script passes
the location of the home directory using the -Dsolr.solr.home system property.


solr home 是通过设置 系统属性来实现的
solr.xml必须存在,不论是在zookeeper上还是在data目录下


Environment overrides include file

The service installation script creates an environment specific include file that overrides defaults used by the bin/solr script. The main advantage of using an include file is that it provides a single location where all of your environment-specific overrides are defined. Take a moment to inspect the contents of the /etc/default/solr.in.sh file, which is the default path setup by the installation script. If you used the -s option on the install script to change the name of the service, then the first part of the filename will be different. For a service named solr-demo, the file will be named /etc/default/solr-demo.in.sh. There are many settings that you can override using this file. However, at a minimum, this script needs to define the SOLR_PID_DIR and SOLR_HOME
variables, such as:

SOLR_PID_DIR=/var/solr
SOLR_HOME=/var/solr/data

The SOLR_PID_DIR variable sets the directory where the start script will write out a file containing the Solr server’s process ID. 



默认使用的是 /etc/default/solr.in.sh.这个文件设置启动参数的,上面提到的两个必须的参数都是存在这个初始化脚本里的. 配置上 /usr/local/solr/bin/init.d/solr 这个脚本
就可以设置一个动态启动了.(看看后面说嘛,不说自己总结下)

Log settings
Solr uses Apache Log4J for logging. The installation script copies /opt/solr/server/resources/log4j.properties to /var/solr/log4j.properties and customizes it for your environment. Specifically it updates the Log4J settings to create logs in the /var/solr/logs directory. Take a moment to verify that the Solr include file is configured to send logs to the correct location by checking the following settings in /etc/defau lt/solr.in.sh :

LOG4J_PROPS=/var/solr/log4j.properties
SOLR_LOGS_DIR=/var/solr/logs

For more information about Log4J configuration, please see: Configuring Logging


关于日志配置文件的设置及日志文件的位置的配置

init.d script

When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators can control Solr using the service tool, such as: service solr start. The installation script creates a very basic init.d script to help you get started. Take a moment to inspect the /etc/init.d/solr file, which is the default script name setup by the installation script. If you used the -s option on the install script to change the name of the service, then the filename will be different. Notice that the following variables are setup for your environment based on the parameters passed to the installation script:

/etc/init.d/solr  这个文件就是 service start solr 调用的脚本了 这个很重要

SOLR_INSTALL_DIR=/opt/solr
SOLR_ENV=/etc/default/solr.in.sh
RUNAS=solr

这三个参数的意思是: 
solr的安装位置 -->用来调用solr
solr的覆盖参数设置
solr的运行者

The SOLR_INSTALL_DIR and SOLR_ENV variables should be self-explanatory. The RUNAS variable sets the
owner of the Solr process, such as solr; if you don’t set this value, the script will run Solr as root, which is not
recommended for production. You can use the /etc/init.d/solr script to start Solr by doing the following as root:
# service solr start

The /etc/init.d/solr script also supports the stop, restart, and status commands. Please keep in mind
that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a service.
However, it’s also common to use more advanced tools like supervisord or upstart to control Solr as a service
on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of this guide, the i
nit.d/solr script should provide enough guidance to help you get started. Also, the installation script sets the
Solr service to start automatically when the host machine initializes. 



Progress Check

In the next section, we cover some additional environment settings to help you fine-tune your production setup.
However, before we move on, let's review what we've achieved thus far. Specifically, you should be able to
control Solr using /etc/init.d/solr. Please verify the following commands work with your setup:
$ sudo service solr restart
$ sudo service solr status
The status command should give some basic information about the running Solr node that looks similar to:

Solr process PID running on port 8983
{
"version":"5.0.0 - ubuntu - 2014-12-17 19:36:58",
"startTime":"2014-12-19T19:25:46.853Z",
"uptime":"0 days, 0 hours, 0 minutes, 8 seconds",
"memory":"85.4 MB (%17.4) of 490.7 MB"}
If the status command is not successful, look for error messages in /var/solr/logs/solr.log.



Fine tune your production setup

Memory and GC Settings

By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for getting
started with Solr. For production, you’ll want to increase the maximum heap size based on the memory
requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production
servers. When you need to change the memory settings for your Solr server, use the SOLR_JAVA_MEM variable
in the include file, such as:
SOLR_JAVA_MEM="-Xms10g -Xmx10g"
Also, the include file comes with a set of pre-configured Java Garbage Collection settings that have shown to
work well with Solr for a number of different workloads. However, these settings may not work well for your
specific use of Solr. Consequently, you may need to change the GC settings, which should also be done with the
GC_TUNE variable in the /etc/default/solr.in.sh include file. For more information about tuning your
memory and garbage collection settings, see: JVM Settings.


关于内存和GC的参数设置


Out-of-Memory Shutdown Hook

The bin/solr script registers the bin/oom_solr.sh script to be called by the JVM if an OutOfMemoryError
occurs. The oom_solr.sh script will issue a kill -9 to the Solr process that experiences the OutOfMemoryE
rror. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is immediately
notified that a node has experienced a non-recoverable error. Take a moment to inspect the contents of the /op
t/solr/bin/oom_solr.sh script so that you are familiar with the actions the script will perform if it is invoked
by the JVM.


当发生内存溢出的时候,solr会调用oom_solr.sh来杀死当前的进程.

SolrCloud

To run Solr in SolrCloud mode, you need to set the ZK_HOST variable in the include file to point to your
ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For
instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port 2181
(zk1, zk2, and zk3), then you would set:



ZK_HOST=zk1,zk2,zk3

When the ZK_HOST variable is set, Solr will launch in "cloud" mode.


当使用solrcloud模式时,需要设置zk_host参数,设置后相当于自动开启了solrCloud模式!

ZooKeeper chroot

If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud
znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored
under /solr, you can put /solr on the end of your ZK_HOST connection string, such as:
ZK_HOST=zk1,zk2,zk3/solr
Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the zkcl
i.sh script. We can use the makepath command for that:
$ server/scripts/cloud-scripts/zkcli.sh -zkhost zk1,zk2,zk3 -cmd makepath /solr

If you also want to bootstrap ZooKeeper with existing solr_home, you can instead use use zkcli.sh /
zkcli.bat's bootstrap command, which will also create the chroot path if it does not exist. See Com
mand Line Utilities for more info.

如果你和别人共用了一个zookeeper集群,那么建议改变solrCloud在zookeeper的根目录,设置如上,需要自己创建该目录!

Solr Hostname

Use the SOLR_HOST variable in the include file to set the hostname of the Solr server.

SOLR_HOST=solr1.example.com

Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this
determines the address of the node when it registers with ZooKeeper.


在 solrcloud模式下,推荐设置 solr_host参数

Override settings in solrconfig.xml

Solr allows configuration properties to be overridden using Java system properties passed at startup using the -
Dproperty=value syntax. For instance, in solrconfig.xml, the default auto soft commit settings are set to:
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
In general, whenever you see a property in a Solr configuration file that uses the ${solr.PROPERTY:DEFAULT
_VALUE} syntax, then you know it can be overridden using a Java system property. For instance, to set the
maxTime for soft-commits to be 10 seconds, then you can start Solr with -Dsolr.autoSoftCommit.maxTime
=10000, such as:
$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000
The bin/solr script simply passes options starting with -D on to the JVM during startup. For running in
production, we recommend setting these properties in the SOLR_OPTS variable defined in the include file.
Keeping with our soft-commit example, in /etc/default/solr.in.sh, you would do:
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"

举了一个例子,来说明如何在初始化配置中设置系统参数
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000" 这个样子就可以了

Enable Remote JMX Access

If you need to attach a JMX-enabled Java profiling tool, such as JConsole or VisualVM, to a remote Solr server,
then you need to enable remote JMX access when starting the Solr server. Simply change the ENABLE_REMOTE
_JMX_OPTS property in the include file to true. You’ll also need to choose a port for the JMX RMI connector to
bind to, such as 18983. For example, if your Solr include script sets:
设置的例子:

ENABLE_REMOTE_JMX_OPTS=true
RMI_PORT=18983

The JMX RMI connector will allow Java profiling tools to attach to port 18983. When enabled, the following
properties are passed to the JVM when starting Solr:
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.local.only=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.port=18983 \
-Dcom.sun.management.jmxremote.rmi.port=18983
We don’t recommend enabling remote JMX access in production, but it can sometimes be useful when doing performance and user-acceptance testing prior to going into production.


如何设置系统的远程访问(不推荐)

Running multiple Solr nodes per host

The bin/solr script is capable of running multiple instances on one machine, but for a typical installation, this
is not a recommended setup. Extra CPU and memory resources are required for each additional instance. A
single instance is easily capable of handling multiple indexes.


When to ignore the recommendation
For every recommendation, there are exceptions, particularly when discussing extreme scalability. The
best reason for running multiple Solr nodes on one host is decreasing the need for extremely large
heaps.
When the Java heap gets very large, it can result in extremely long garbage collection pauses, even with
the GC tuning that the startup script provides by default. The exact point at which the heap is
considered "very large" will vary depending on how Solr is used. This means that there is no hard
number that can be given as a threshold, but if your heap is reaching the neighborhood of 16 to 32
gigabytes, it might be time to consider splitting nodes. Ideally this would mean more machines, but
budget constraints might make that impossible.
There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use compressed
pointers, but above that point, larger pointers are required, which uses more memory and slows down
the JVM.
Because of the potential garbage collection issues and the particular issues that happen at 32GB, if a
single instance would require a 64GB heap, performance is likely to improve greatly if the machine is set
up with two nodes that each have a 31GB heap.


不推荐在一台机器上启动多个solr实例,考虑到java的垃圾回收问题.

If your use case requires multiple instances, at a minimum you will need unique Solr home directories for each
node you want to run; ideally, each home should be on a different physical disk so that multiple Solr nodes don’t
have to compete with each other when accessing files on disk. Having different Solr home directories implies that
you’ll need a different include file for each node. Moreover, if using the /etc/init.d/solr script to control
Solr as a service, then you’ll need a separate script for each node. The easiest approach is to use the service
installation script to add multiple services on the same host, such as:


$ sudo bash ./install_solr_service.sh solr-X.Y.Z.tgz -s solr2 -p 8984

The command shown above will add a service named solr2 running on port 8984 using /var/solr2 for
writable (aka "live") files; the second server will still be owned and run by the solr user and will use the Solr
distribution files in /opt. After installing the solr2 service, verify it works correctly by doing:
$ sudo service solr2 restart
$ sudo service solr2 status


如果一定要启动多个实例,可以向上面的命令那样进行处理

这里相当于又设置了一份,我们自己可以写一个脚本来完成自动化部署嘛
....这个等我学习完liunx部分的知识,动手写一个自动化的集群和单机版的部署脚本

Securing Solr

When planning how to secure Solr, you should consider which of the available features or approaches are right
for you.
Authentication or authorization of users using:
Kerberos Authentication Plugin
Basic Authentication Plugin
Rule-Based Authorization Plugin
Custom authentication or authorization plugin
Enabling SSL
If using SolrCloud, ZooKeeper Access Control


这部分主要说的是solr的安全工作,身份认证及权限管理

这部分原来写过,仅仅研究一下单机版本的访问控制即可





Kerberos Authentication Plugin


还是需要一个服务器作为一个令牌中心 





SolrCloud

Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high
availability. Called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting
the following features:
Central configuration for the entire cluster
Automatic load balancing and fail-over for queries
ZooKeeper integration for cluster coordination and configuration.


In this section, we'll cover everything you need to know about using Solr in SolrCloud mode. We've split up the
details into the following topics:
Getting Started with SolrCloud
How SolrCloud Works
Shards and Indexing Data in SolrCloud
Distributed Requests
Read and Write Side Fault Tolerance
SolrCloud Configuration and Parameters
Setting Up an External ZooKeeper Ensemble
Using ZooKeeper to Manage Configuration Files
ZooKeeper Access Control
Collections API
Parameter Reference
Command Line Utilities
SolrCloud with Legacy Configuration Files
ConfigSets API
Rule-based Replica Placement
Cross Data Center Replication (CDCR)




Getting Started with SolrCloud

SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed
content and query requests across multiple servers. It's a system in which data is organized into multiple pieces,
or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and
fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search requests can be routed properly.


This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of
what it is you're trying to accomplish. This page provides a simple tutorial to start Solr in SolrCloud mode, so you
can begin to get a sense for how shards interact with each other during indexing and when serving queries. To that end, we'll use simple examples of configuring SolrCloud on a single machine, which is obviously not a real production environment, which would include several servers or virtual machines. In a real production environment, you'll also use the real machine names instead of "localhost" which we've used here.

In this section you will learn how to start a SolrCloud cluster using startup scripts and a specific configset.


SolrCloud Example

Interactive Startup
The bin/solr script makes it easy to get started with SolrCloud as it walks you through the process of
launching Solr nodes in cloud mode and adding a collection. To get started, simply do:
$ bin/solr -e cloud
This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with
embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local
cluster, with the default being 2.
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local
workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify
1-4 nodes) [2]
The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. These
nodes will each exist on a single machine, but will use different ports to mimic operation on different servers.
Next, the script will prompt you for the port to bind each of the Solr nodes to, such as:
Please enter the port for node1 [8983]
Choose any available port for each node; the default for the first node is 8983 and 7574 for the second
node. The script will start each node in order and shows you the command it uses to start the server, such as:
solr start -cloud -s example/cloud/node1/solr -p 8983
The first node will also start an embedded ZooKeeper server bound to port 9983. The Solr home for the first
node is in example/cloud/node1/solr as indicated by the -s option.
After starting up all nodes in the cluster, the script prompts you for the name of the collection to create:
Please provide a name for your new collection: [gettingstarted]
The suggested default is "gettingstarted" but you might want to choose a name more appropriate for your
specific search application.
Next, the script prompts you for the number of shards to distribute the collection across. Sharding is covered in
more detail later on, so if you're unsure, we suggest using the default of 2 so that you can see how a collection is
distributed across multiple nodes in a SolrCloud cluster.
Next, the script will prompt you for the number of replicas to create for each shard. Replication is covered in
more detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is
handled in SolrCloud.
Lastly, the script will prompt you for the name of a configuration directory for your collection. You can choose bas
ic_configs, data_driven_schema_configs, or sample_techproducts_configs. The configuration directories

are pulled from server/solr/configsets/ so you can review them beforehand if you wish. The data_drive
n_schema_configs configuration (the default) is useful when you're still designing a schema for your documents
and need some flexiblity as you experiment with Solr.
At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run
the status command:
$ bin/solr status
If you encounter any errors during this process, check the Solr log files in example/cloud/node1/logs and e
xample/cloud/node2/logs.
You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: h
ttp://localhost:8983/solr/#/~cloud. Solr also provides a way to perform basic diagnostics for a collection using the
healthcheck command:
$ bin/solr healthcheck -c gettingstarted
The healthcheck command gathers basic information about each replica in a collection, such as number of docs,
current status (active, down, etc), and address (where the replica lives in the cluster).
Documents can now be added to SolrCloud using the Post Tool.
To stop Solr in SolrCloud mode, you would use the bin/solr script and issue the stop command, as in:
$ bin/solr stop -all


如何开始一个solrcloud模式的例子

Starting with -noprompt
开始solrcloud 使用默认值

You can also get SolrCloud started with all the defaults instead of the interactive session using the following
command:
$ bin/solr -e cloud -noprompt


Restarting Nodes

重新开始相应的集群节点

You can restart your SolrCloud nodes using the bin/solr script. For instance, to restart node1 running on port
8983 (with an embedded ZooKeeper server), you would do:
$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr
To restart node2 running on port 7574, you can do:
$ bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr
Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join
the cluster with node1.


Adding a node to a cluster

加入一个新的节点到solrcloud集群中去

Adding a node to an existing cluster is a bit advanced and involves a little more understanding of Solr. Once you
startup a SolrCloud cluster using the startup scripts, you can add a new node to it by:
$ mkdir <solr.home for new solr node>
$ cp <existing solr.xml path> <new solr.home>
$ bin/solr start -cloud -s solr.home/solr -p <port num> -z <zk hosts string>
Notice that the above requires you to create a Solr home directory. You either need to copy solr.xml to the so
lr_home directory, or keep in centrally in ZooKeeper /solr.xml.
Example (with directory structure) that adds a node to an example started with "bin/solr -e cloud":
$ mkdir -p example/cloud/node3/solr
$ cp server/solr/solr.xml example/cloud/node3/solr
$ bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983
The previous command will start another Solr node on port 8987 with Solr home set to example/cloud/node3
/solr. The new node will write its log files to example/cloud/node3/logs.
Once you're comfortable with how the SolrCloud example works, we recommend using the process described in
Taking Solr to Production for setting up SolrCloud nodes in production.



How SolrCloud Works

The following sections cover provide general information about how various SolrCloud features work. To
understand these features, it's important to first understand a few key concepts that relate to SolrCloud.
Shards and Indexing Data in SolrCloud
Distributed Requests
Read and Write Side Fault Tolerance
If you are already familiar with SolrCloud concepts and basic functionality, you can skip to the section covering S
olrCloud Configuration and Parameters.



Key SolrCloud Concepts

A SolrCloud cluster consists of some "logical" concepts layered on top of some "physical" concepts.

Logical
A Cluster can host multiple Collections of Solr Documents.
A collection can be partitioned into multiple Shards, which contain a subset of the Documents in the Collection.
The number of Shards that a Collection has determines:
The theoretical limit to the number of Documents that Collection can reasonably contain.
The amount of parallelization that is possible for an individual search request.


Physical

A Cluster is made up of one or more Solr Nodes, which are running instances of the Solr server process.
Each Node can host multiple Cores.
Each Core in a Cluster is a physical Replica for a logical Shard.
Every Replica uses the same configuration specified for the Collection that it is a part of.
The number of Replicas that each Shard has determines:

The level of redundancy built into the Collection and how fault tolerant the Cluster can be in the event that some Nodes become unavailable.
The theoretical limit in the number concurrent search requests that can be processed under heavy load.


Shards and Indexing Data in SolrCloud

When your data is too large for one node, you can break it up and store it in sections by creating one or more sh
ards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the
index.
A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard
for data that represents each state, or different categories that are likely to be searched independently, but are
often combined.
Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple
shards, so the query was executed against the entire Solr index and no documents would be missed from the
search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however,
several problems with the distributed approach that necessitated improvement with SolrCloud:
Splitting of the core into shards was somewhat manual.
There was no support for distributed indexing, which meant that you needed to explicitly send documents
to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
There was no load balancing or failover, so if you got a high number of queries, you needed to figure out
where to send them and if one shard died it was just gone.
SolrCloud fixes all those problems. There is support for distributing both the index process and the queries
automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have
multiple replicas for additional robustness.
In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically
elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://z
ookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..
If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's
assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard
ID.
When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.
If the machine is a replica, the document is forwarded to the leader for processing.
If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the
document the leader for that shard, indexes the document for this shard, and forwards the index notation
to itself and any replicas.


为何要用分片副本,及分片副本的工作原理

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the router.name p
arameter when creating your collection. If you use the "compositeId" router, you can send documents with a
prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document
is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for
example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate
documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for
example, with a document with the ID "12345", you would insert the prefix into the document id field:
"IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which
shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_rout
e_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because
it overcomes network latency when querying all the shards

The compositeId router supports prefixes containing up to 2 levels of routing. For example: a prefix routing
first by region, then by customer: "USA!IBM!12345"
Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across
multiple shards. The syntax for such a use case would be : "shard_key/num!document_id" where the /num is the
number of bits from the shard key to use in the composite hash.
So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant
over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across
1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your
query with the _route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to specific shards.
If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.
If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a
router.field parameter to use a field from each document to identify a shard where the document belongs. If
the field specified is missing in the document, however, the document will be rejected. You could also use the _r
oute_ parameter to name a specific shard.


分片时的文档路由规则及路由参数的使用

Shard Splitting

When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be
difficult to know in advance the number of shards that you need, particularly when organizational requirements
can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving
creating new cores and re-indexing all of your data.
The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The
existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can
delete the old shard at a later time when you're ready.
More details on how to use shard splitting is in the section on the Collections API.


使用分片切割可以对分片的数量进行扩充.解决分片一开始固定的问题



Ignoring Commits from Client Applications in SolrCloud

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit
requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to
make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in
the cluster. To enforce a policy where client applications should not send explicit commits, you should update all
client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides
the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize
requests from client applications without having refactor your client application code. To activate this request
processor you'll need to add the following to your solrconfig.xml:


对于分布式solrcloud你应该关闭显示的commit提交.采取软提交的自动提交的方式.当然,这也不能是很确保,可以在solrconfig.xml中配置如下来忽略相关commit的最优化参数

<updateRequestProcessorChain name="ignore-commit-from-client" default="true">
<processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
<int name="statusCode">200</int>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>


As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimize request. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this custom chain is taking the place of the default chain.
In the following example, the processor will raise an exception with a 403 code with a customized error message:
可以定义一个异常,当发送commit或者最优化命令时
<updateRequestProcessorChain name="ignore-commit-from-client" default="true">
<processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
<int name="statusCode">403</int>
<str name="responseMessage">Thou shall not issue a commit!</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>


Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:

也能仅仅只忽略最优化操作让commit命令通过

<updateRequestProcessorChain name="ignore-optimize-only-from-client-403">
<processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
<str name="responseMessage">Thou shall not issue an optimize, but commits are
OK!</str>
<bool name="ignoreOptimizeOnly">true</bool>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>



Distributed Requests

When a Solr node receives a search request, that request is routed behinds the scenes to a replica of some
shard that is part of the collection being searched. The chosen replica will act as an aggregator: creating internal
requests to randomly chosen replicas of every shard in the collection, coordinating the responses, issuing any
subsequent internal requests as needed (For example, to refine facets values, or request additional stored fields)
and constructing the final response for the client.

Limiting Which Shards are Queried

指定分片进行查询

One of the advantages of using SolrCloud is the ability to very large collections distributed among various shards
– but in some cases you may know that you are only interested in results from a subset of your shards. You
have the option of searching over all of your data or just parts of it.
Querying all shards for a collection should look familiar; it's as though SolrCloud didn't even come into play:

http://localhost:8983/solr/gettingstarted/select?q=*:*
If, on the other hand, you wanted to search just one shard, you can specify that shard by it's logical ID, as in:
http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1
If you want to search a group of shard Ids, you can specify them together:
http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,shard2
In both of the above examples, the shard Id(s) will be used to pick a random replica of that shard.
Alternatively, you can specify the explict replicas you wish to use in place of a shard Ids:
http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge
ttingstarted,localhost:8983/solr/gettingstarted
Or you can specify a list of replicas to choose from for a single shard (for load balancing purposes) by using the
pipe symbol (|):
http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/ge
ttingstarted|localhost:7500/solr/gettingstarted
And of course, you can specify a list of shards (seperated by commas) each defined by a list of replicas
(seperated by pipes). In this example, 2 shards are queried, the first being a random replica from shard1, the
second being a random replica from the explicit pipe delimited list:
http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,localhost:7574/
solr/gettingstarted|localhost:7500/solr/gettingstarted



Configuring the ShardHandlerFactory

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr.
This allows for finer grained control and you can tune it to target your own specific requirements. The default
configuration favors throughput over latency.
To configure the standard handler, provide a configuration like this in the solrconfig.xml:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<!-- other params go here -->
<shardHandler class="HttpShardHandlerFactory">
<int name="socketTimeOut">1000</int>
<int name="connTimeOut">5000</int>
</shardHandler>
</requestHandler>


分片处理器的配置及相应的参数设置


Configuring statsCache (Distributed IDF)
Document and term statistics are needed in order to calculate relevancy. Solr provides four implementations out
of the box when it comes to document stats calculation:
LocalStatsCache: This only uses local term and document statistics to compute relevance. In cases
with uniform term distribution across shards, this works reasonably well.
This option is the default if no <statsCache> is configured.
ExactStatsCache: This implementation uses global values (across the collection) for document
frequency.
ExactSharedStatsCache: This is exactly like the exact stats cache in it's functionality but the global
stats are reused for subsequent requests with the same terms.
LRUStatsCache: This implementation uses an LRU cache to hold global stats, which are shared
between requests.
The implementation can be selected by setting <statsCache> in solrconfig.xml. For example, the
following line makes Solr use the ExactStatsCache implementation:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>


文档和词频的统计及相关实现


Avoiding Distributed Deadlock

Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should
be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number
of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a
distributed deadlock.
For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP
requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other.
Because there are no more remaining threads to service requests, the incoming requests will be blocked until the
other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By
ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like
this.


配置更多的线程来避免死锁

Prefer Local Shards

Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a
distributed query should prefer local replicas of a shard when available. In other words, if a query includes prefe
rLocalShards=true, then the query controller will look for local replicas to service the query instead of
selecting replicas at random from across the cluster. This is useful when a query requests many fields or large
fields to be returned per document because it avoids moving large amounts of data over the network when it is
available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with
degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.
Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases
because the query controller will have to direct the query to non-local replicas for most of the shards. In other
words, this feature is mostly useful for optimizing queries directed towards collections with a small number of
shards and many replicas. Also, this option should only be used if you are load balancing requests across all
nodes that host replicas for the collection you are querying, as Solr's CloudSolrClient will do. If not
load-balancing, this feature can introduce a hotspot in the cluster since queries won't be evenly distributed
across the cluster.


适用于本地的查询 有时候更加方便

Read and Write Side Fault Tolerance

读写和容错

SolrCloud supports elasticity, high availability, and fault tolerance in reads and writes. What this means, basically, is that when you have a large cluster, you can always make requests to the cluster: Reads will return results whenever possible, even if some nodes are down, and Writes will be acknowledged only if they are durable; i.e., you won't lose data.


Read Side Fault Tolerance

In a SolrCloud cluster each individual node load balances read requests across all the replicas in collection. You
still need a load balancer on the 'outside' that talks to the cluster, or you need a smart client which understands
how to read and interact with Solr's metadata in ZooKeeper and only requests the ZooKeeper ensemble's
address to start discovering to which nodes it should send requests. (Solr provides a smart Java SolrJ client
called CloudSolrClient.)
Even if some nodes in the cluster are offline or unreachable, a Solr node will be able to correctly respond to a
search request as long as it can communicate with at least one replica of every shard, or one replica of every rel
evant shard if the user limited the search via the 'shards' or '_route_' parameters. The more replicas there are
of every shard, the more likely that the Solr cluster will be able to handle search results in the event of node
failures.


zkConnected

zk连接的参数在节点处理数据时

A Solr node will return the results of a search request as long as it can communicate with at least one replica of
every shard that it knows about, even if it can not communicate with ZooKeeper at the time it receives the
request. This is normally the preferred behavior from a fault tolerance standpoint, but may result in stale or
incorrect results if there have been major changes to the collection structure that the node has not been informed
of via ZooKeeper (ie: shards may have been added or removed, or split into sub-shards)
A zkConnected header is included in every search response indicating if the node that processed the request
was connected with ZooKeeper at the time:
{
"responseHeader": {
"status": 0,
"zkConnected": true,
"QTime": 20,
"params": {
"q": "*:*"
}
},
"response": {
"numFound": 107,
"start": 0,
"docs": [ ... ]
}
}



shards.tolerant

In the event that one or more shards queried are completely unavailable, then Solr's default behavior is to fail the
request. However, there are many use-cases where partial results are acceptable and so Solr provides a
boolean shards.tolerant parameter (default 'false'). If shards.tolerant=true then partial results may
be returned. If the returned response does not contain results from all the appropriate shards then the response
header contains a special flag called 'partialResults'. The client can specify 'shards.info' along with the '
shards.tolerant' parameter to retrieve more fine-grained details.
Example response with partialResults flag set to 'true':
{
"responseHeader": {
"status": 0,
"zkConnected": true,
"partialResults": true,
"QTime": 20,
"params": {
"q": "*:*"
}
},
"response": {
"numFound": 77,
"start": 0,
"docs": [ ... ]
}
}


Write Side Fault Tolerance

写入容错

SolrCloud is designed to replicate documents to ensure redundancy for your data, and enable you to send
update requests to any node in the cluster. That node will determine if it hosts the leader for the appropriate
shard, and if not it will forward the request to the the leader, which will then forwards it to all existing replicas,
using versioning to make sure every replica has the most up-to-date version. If the leader goes down, and other
replica can take it's place. This architecture enables you to be certain that your data can be recovered in the
event of a disaster, even if you are using Near Real Time Searching.


描述数据更新流程及容灾

Recovery

A Transaction Log is created for each node so that every change to content or organization is noted. The log is
used to determine which content in the node should be included in a replica. When a new replica is created, it
refers to the Leader and the Transaction Log to know which content to include. If it fails, it retries.


副本从lead的日志中获取信息那些数据应该被储存

Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes
redoing the uncommitted updates if indexing is interrupted.
If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential
leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be
consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of sync, the
system asks for a full replication/replay-based recovery.
If an update fails because cores are reloading schemas and some have finished but others have not, the leader
tells the nodes that the update failed and starts the recovery procedure. 


Achieved Replication Factor

When using a replication factor greater than one, an update request may succeed on the shard leader but fail on
one or more of the replicas. For instance, consider a collection with one shard and replication factor of three. In
this case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but
fails on both replicas, for whatever reason, the update request is still considered successful from the perspective
of the client. The replicas that missed the update will sync with the leader when they recover.
Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current
leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the
achieved replication factor for an update request in the response. For the example scenario described above, if
the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the
request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only
tells Solr that the client application wishes to know what the achieved replication factor was for the update
request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not
support rolling back updates that succeed on a subset of replicas. 
On the client side, if the achieved replication factor is less than the acceptable level, then the client application
can take additional measures to handle the degraded state. For instance, a client application may want to keep a
log of which update requests were sent while the state of the collection was degraded and then resend the
updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application
to be warned that an update request was accepted while the collection is in a degraded state.



当请求被lead更新成功但是副本却是失败时,solrcloud依然认为成功的更新,其余的副本以后会进行数据恢复冲lead的日志文件中.但是你可以设置一个参数来获取成功执行的副本的数目. 使用参数 min_rf .根据返回结果再做相应的处理





SolrCloud Configuration and Parameters
集群的配置和参数

In this section, we'll cover the various configuration options for SolrCloud.
The following sections cover these topics:
Setting Up an External ZooKeeper Ensemble
Using ZooKeeper to Manage Configuration Files

ZooKeeper Access Control
Collections API
Parameter Reference
Command Line Utilities
SolrCloud with Legacy Configuration Files
ConfigSets API


Setting Up an External ZooKeeper Ensemble
使用外部zookeeper集群

Although Solr comes bundled with Apache ZooKeeper, you should consider yourself discouraged from using this
internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its
ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum
of more than half its servers running at any given time, this can be a problem.
The solution to this problem is to set up an external ZooKeeper ensemble. Fortunately, while this process can
seem intimidating due to the number of powerful options, setting up a simple ensemble is actually quite
straightforward, as described below.


为什么要使用外部的zookeeper集群,及告诉你构建是很简单的

How Many ZooKeepers?

ZooKeeper deployments are usually made up of an odd number of machines.

When planning how many ZooKeeper nodes to configure, keep in mind that the main principle for a ZooKeeper
ensemble is maintaining a majority of servers to serve requests. This majority is also called a quorum. It is
generally recommended to have an odd number of ZooKeeper servers in your ensemble, so a majority is
maintained. For example, if you only have two ZooKeeper nodes and one goes down, 50% of available servers
is not a majority, so ZooKeeper will no longer serve requests. However, if you have three ZooKeeper nodes and
one goes down, you have 66% of available servers available, and ZooKeeper will continue normally while you
repair the one down node. If you have 5 nodes, you could continue operating with two down nodes if necessary.
More information on ZooKeeper clusters is available from the ZooKeeper documentation at http://zookeeper.apa
che.org/doc/r3.4.5/zookeeperAdmin.html#sc_zkMulitServerSetup.


Download Apache ZooKeeper
下载
The first step in setting up Apache ZooKeeper is, of course, to download the software. It's available from http://zookeeper.apache.org/releases.html.

Solr currently uses Apache ZooKeeper v3.4.6.

集群创建步骤
Setting Up a Single ZooKeeper

Create the instance

Creating the instance is a simple matter of extracting the files into a specific target directory. The actual directory
itself doesn't matter, as long as you know where it is, and where you'd like to have ZooKeeper store its internal
data.

Configure the instance
The next step is to configure your ZooKeeper instance. To do that, create the following file: <ZOOKEEPER_HOME>
/conf/zoo.cfg. To this file, add the following information:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181

The parameters are as follows:
tickTime: Part of what ZooKeeper does is to determine which servers are up and running at any given time, and
the minimum session time out is defined as two "ticks". The tickTime parameter specifies, in miliseconds, how
long each tick should be.
dataDir: This is the directory in which ZooKeeper will store data about the cluster. This directory should start out
empty.
clientPort: This is the port on which Solr will access ZooKeeper.
Once this file is in place, you're ready to start the ZooKeeper instance.


配置相关信息

Run the instance

To run the instance, you can simply use the ZOOKEEPER_HOME/bin/zkServer.sh script provided, as with this
command: zkServer.sh start
Again, ZooKeeper provides a great deal of power through additional configurations, but delving into them is
beyond the scope of this tutorial. For more information, see the ZooKeeper Getting Started page. For this
example, however, the defaults are fine.


Point Solr at the instance

Pointing Solr at the ZooKeeper instance you've created is a simple matter of using the -z parameter when using
the bin/solr script. For example, in order to point the Solr instance to the ZooKeeper you've started on port 2181,
this is what you'd need to do:
Starting cloud example with Zookeeper already running at port 2181 (with all other defaults):
bin/solr start -e cloud -z localhost:2181 -noprompt
Add a node pointing to an existing ZooKeeper at port 2181:
bin/solr start -cloud -s <path to solr home for new node> -p 8987 -z localhost:2181
NOTE: When you are not using an example to start solr, make sure you upload the configuration set to
zookeeper before creating the collection.


使用这个zookeeper实例开始集群

Shut down ZooKeeper

To shut down ZooKeeper, use the zkServer script with the "stop" command: zkServer.sh stop

Setting up a ZooKeeper Ensemble
设置zk集群

With an external ZooKeeper ensemble, you need to set things up just a little more carefully as compared to the
Getting Started example.
The difference is that rather than simply starting up the servers, you need to configure them to know about and
talk to each other first. So your original zoo.cfg file might look like this:
dataDir=/var/lib/zookeeperdata/1
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890
Here you see three new parameters:
initLimit: Amount of time, in ticks, to allow followers to connect and sync to a leader. In this case, you have 5
ticks, each of which is 2000 milliseconds long, so the server will wait as long as 10 seconds to connect and sync
with the leader.
syncLimit: Amount of time, in ticks, to allow followers to sync with ZooKeeper. If followers fall too far behind a
leader, they will be dropped.
server.X: These are the IDs and locations of all servers in the ensemble, the ports on which they communicate
with each other. The server ID must additionally stored in the <dataDir>/myid file and be located in the dataD
ir of each ZooKeeper instance. The ID identifies each server, so in the case of this first instance, you would
create the file /var/lib/zookeeperdata/1/myid with the content "1".
Now, whereas with Solr you need to create entirely new directories to run multiple instances, all you need for a
new ZooKeeper instance, even if it's on the same machine for testing purposes, is a new configuration file. To
complete the example you'll create two more configuration files.
The <ZOOKEEPER_HOME>/conf/zoo2.cfg file should have the content:
tickTime=2000
dataDir=c:/sw/zookeeperdata/2
clientPort=2182
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890
You'll also need to create <ZOOKEEPER_HOME>/conf/zoo3.cfg:
tickTime=2000
dataDir=c:/sw/zookeeperdata/3
clientPort=2183
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890

Finally, create your myid files in each of the dataDir directories so that each server knows which instance it is.
The id in the myid file on each machine must match the "server.X" definition. So, the ZooKeeper instance (or
machine) named "server.1" in the above example, must have a myid file containing the value "1". The myid file
can be any integer between 1 and 255, and must match the server IDs assigned in the zoo.cfg file.
To start the servers, you can simply explicitly reference the configuration files:
cd <ZOOKEEPER_HOME>
bin/zkServer.sh start zoo.cfg
bin/zkServer.sh start zoo2.cfg
bin/zkServer.sh start zoo3.cfg
Once these servers are running, you can reference them from Solr just as you did before:
bin/solr start -e cloud -z localhost:2181,localhost:2182,localhost:2183 -noprompt
For more information on getting the most power from your ZooKeeper installation, check out the ZooKeeper
Administrator's Guide.


配置启动zk集群及使用zk启动solr

Securing the ZooKeeper connection
You may also want to secure the communication between ZooKeeper and Solr.
To setup ACL protection of znodes, see ZooKeeper Access Control.

Using ZooKeeper to Manage Configuration Files

With SolrCloud your configuration files are kept in ZooKeeper. These files are uploaded in either of the following cases:
When you start a SolrCloud example using the bin/solr script.
When you create a collection using the bin/solr script.
Explicitly upload a configuration set to ZooKeeper.


Startup Bootstrap
When you try SolrCloud for the first time using the bin/solr -e cloud, the related configset gets uploaded to
zookeeper automatically and is linked with the newly created collection.
The below command would start SolrCloud with the default collection name (gettingstarted) and default configset
(data_driven_schema_configs) uploaded and linked to it.
$ bin/solr -e cloud -noprompt
You can also explicitly upload a configuration directory when creating a collection using the bin/solr
script with the -d option, such as:
$ bin/solr create -c mycollection -d data_driven_schema_configs
The create command will upload a copy of the data_driven_schema_configs configuration directory to
ZooKeeper under /configs/mycollection. Refer to the Solr Start Script Reference page for more details
about the create command for creating collections

Once a configuration directory has been uploaded to ZooKeeper, you can update them using the ZooKeeper
Command Line Interface (zkCLI).

默认上传配置文件在创建集合时


Uploading configs using zkcli or SolrJ
更新配置文件使用zkcli或者solrj客户端

In production situations, Config Sets can also be uploaded to ZooKeeper independent of collection creation using
either Solr's zkcli.sh script, or the CloudSolrClient.uploadConfig java method.
The below command can be used to upload a new configset using the zkcli script.

$ sh zkcli.sh -cmd upconfig -zkhost <host:port> -confname <name for configset>
-solrhome <solrhome> -confdir <path to directory with configset

More information about the ZooKeeper Command Line Utility to help manage changes to configuration files, can
be found in the section on Command Line Utilities.


Managing Your SolrCloud Configuration Files

To update or change your SolrCloud configuration files:
Download the latest configuration files from ZooKeeper, using the source control checkout process.
Make your changes.
Commit your changed file to source control.
Push the changes back to ZooKeeper.
Reload the collection so that the changes will be in effect.


配置文件管理过程 这里的源代码控制指的是什么?

Preparing ZooKeeper before first cluster start
If you will share the same ZooKeeper instance with other applications you should use a chroot in ZooKeeper.
Please see Taking Solr to Production#ZooKeeperchroot for instructions.
There are certain configuration files containing cluster wide configuration. Since some of these are crucial for the
cluster to function properly, you may need to upload such files to ZooKeeper before starting your Solr cluster for
the first time. Examples of such configuration files (not exhaustive) are solr.xml, security.json and clus
terprops.json.
If you for example would like to keep your solr.xml in ZooKeeper to avoid having to copy it to every node's so
lr_home directory, you can push it to ZooKeeper with the zkcli.sh utility (Unix example):
zkcli.sh -zkhost localhost:2181 -cmd putfile /solr.xml /path/to/solr.xml


在solr集群启动之前将所需的配置文件上传到zookeeper中,例如solr.xml文件的推送. 安全控制文件,集群属性等文件


ZooKeeper Access Control
This section describes using ZooKeeper access control lists (ACLs) with Solr. For information about ZooKeeper
ACLs, see the ZooKeeper documentation at http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.htm
l#sc_ZooKeeperAccessControl.
About ZooKeeper ACLs
How to Enable ACLs

Changing ACL Schemes
Example Usages

zk的访问控制部分

About ZooKeeper ACLs
SolrCloud uses ZooKeeper for shared information and for coordination.
This section describes how to configure Solr to add more restrictive ACLs to the ZooKeeper content it creates,
and how to tell Solr about the credentials required to access the content in ZooKeeper. If you want to use ACLs
in your ZooKeeper nodes, you will have to activate this functionality; by default, Solr behavior is open-unsafe
ACL everywhere and uses no credentials.
Changing Solr-related content in ZooKeeper might damage a SolrCloud cluster. For example:
Changing configuration might cause Solr to fail or behave in an unintended way.
Changing cluster state information into something wrong or inconsistent might very well make a SolrCloud
cluster behave strangely.
Adding a delete-collection job to be carried out by the Overseer will cause data to be deleted from the
cluster.
You may want to enable ZooKeeper ACLs with Solr if you grant access to your ZooKeeper ensemble to entities
you do not trust, or if you want to reduce risk of bad actions resulting from, e.g.:
Malware that found its way into your system.
Other systems using the same ZooKeeper ensemble (a "bad thing" might be done by accident).
You might even want to limit read-access, if you think there is stuff in ZooKeeper that not everyone should know
about. Or you might just in general work on need-to-know-basis.
Protecting ZooKeeper itself could mean many different things. This section is about protecting Solr content
in ZooKeeper. ZooKeeper content basically lives persisted on disk and (partly) in memory of the ZooKeeper
processes. This section is not about protecting ZooKeeper data at storage or ZooKeeper process levels -
that's for ZooKeeper to deal with.
But this content is also available to "the outside" via the ZooKeeper API. Outside processes can connect to
ZooKeeper and create/update/delete/read content; for example, a Solr node in a SolrCloud cluster wants to
create/update/delete/read, and a SolrJ client wants to read from the cluster. It is the responsibility of the outside
processes that create/update content to setup ACLs on the content. ACLs describe who is allowed to read,
update, delete, create, etc. Each piece of information (znode/content) in ZooKeeper has its own set of ACLs, and
inheritance or sharing is not possible. The default behavior in Solr is to add one ACL on all the content it creates
- one ACL that gives anyone the permission to do anything (in ZooKeeper terms this is called "the open-unsafe
ACL").



为什么要使用acls及 acls涉及的内容 

How to Enable ACLs

We want to be able to:
Control the credentials Solr uses for its ZooKeeper connections. The credentials are used to get
permission to perform operations in ZooKeeper.
Control which ACLs Solr will add to znodes (ZooKeeper files/folders) it creates in ZooKeeper.
Control it "from the outside", so that you do not have to modify and/or recompile Solr code to turn this on.
Solr nodes, clients and tools (e.g. ZkCLI) always use a java class called SolrZkClient to deal with their
ZooKeeper stuff. The implementation of the solution described here is all about changing SolrZkClient. If you
use SolrZkClient in your application, the descriptions below will be true for your application too.


solr和zk交互的几种方式都是使用了solrzkclient 关键就在这里


Controlling Credentials
You control which credentials provider will be used by configuring the zkCredentialsProvider property in solr.xml's <solrcloud> section to the name of a class (on the classpath) implementing the following interface:
package org.apache.solr.common.cloud;
public interface ZkCredentialsProvider {
public class ZkCredentials {
String scheme;
byte[] auth;
public ZkCredentials(String scheme, byte[] auth) {
super();
this.scheme = scheme;
this.auth = auth;
}
String getScheme() {
return scheme;
}
byte[] getAuth() {
return auth;
}
}
Collection<ZkCredentials> getCredentials();
}

Solr determines which credentials to use by calling the getCredentials() method of the given credentialsprovider. If no provider has been configured, the default implementation, DefaultZkCredentialsProvider is used.

solr获取令牌的途径

Out of the Box Implementations

You can always make you own implementation, but Solr comes with two implementations:
org.apache.solr.common.cloud.DefaultZkCredentialsProvider: Its getCredentials()
returns a list of length zero, or "no credentials used". This is the default and is used if you do not configure
a provider in solr.xml.
org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsPr
ovider: This lets you define your credentials using system properties. It supports at most one set of
credentials.
The schema is "digest". The username and password are defined by system properties "zkDiges
tUsername" and "zkDigestPassword", respectively. This set of credentials will be added
to the list of credentials returned by getCredentials() if both username and password are
provided.
If the one set of credentials above is not added to the list, this implementation will fall back to
default behavior and use the (empty) credentials list from DefaultZkCredentialsProvider


solr提供了两个实现类,第一个是默认的空令牌,第二个是可以通过系统属性去添加一个令牌的方法

Controlling ACLs

You control which ACLs will be added by configuring zkACLProvider property in solr.xml 's  <solrcloud> section to the name of a class (on the classpath) implementing the following interface:

package org.apache.solr.common.cloud;
public interface ZkACLProvider {
List<ACL> getACLsToAdd(String zNodePath);
}

When solr wants to create a new znode, it determines which ACLs to put on the znode by calling the getACLsT
oAdd() method of the given acl provider. If no provider has been configured, the default implementation, Defau
ltZkACLProvider is used.


当solr创建一个节点会调用获取令牌的方法来设置谁有改节点的权限

Out of the Box Implementations
You can always make you own implementation, but Solr comes with:
org.apache.solr.common.cloud.DefaultZkACLProvider: It returns a list of length one for all z
NodePath-s. The single ACL entry in the list is "open-unsafe". This is the default and is used if you do not
configure a provider in solr.xml.
org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider: This lets
you define your ACLs using system properties. Its getACLsToAdd() implementation does not use zNod
ePath for anything, so all znodes will get the same set of ACLs. It supports adding one or both of these
options:
A user that is allowed to do everything. 
The permission is "ALL" (corresponding to all of CREATE, READ, WRITE, DELETE, and ADMI
N), and the schema is "digest". 
The username and password are defined by system properties "zkDigestUsername" and "
zkDigestPassword", respectively. 
This ACL will not be added to the list of ACLs unless both username and password are
provided.
A user that is only allowed to perform read operations. 
The permission is "READ" and the schema is "digest". 
The username and password are defined by system properties "zkDigestReadonlyUsern
ame" and "zkDigestReadonlyPassword, respectively. 
This ACL will not be added to the list of ACLs unless both username and password are
provided.
If neither of the above ACLs is added to the list, the (empty) ACL list of DefaultZkACLProvider will
be used by default.
Notice the overlap in system property names with credentials provider VMParamsSingleSetCredentialsDig
estZkCredentialsProvider (described above). This is to let the two providers collaborate in a nice and
perhaps common way: we always protect access to content by limiting to two users - an admin-user and a
readonly-user - AND we always connect with credentials corresponding to this same admin-user, basically so
that we can do anything to the content/znodes we create ourselves.
You can give the readonly credentials to "clients" of your SolrCloud cluster - e.g. to be used by SolrJ clients.
They will be able to read whatever is necessary to run a functioning SolrJ client, but they will not be able to
modify any content in ZooKeeper.



提供访问控制列表 --两个默认实现


Changing ACL Schemes
Over the lifetime of operating your Solr cluster, you may decide to move from an unsecured ZooKeeper to a
secured instance. Changing the configured zkACLProvider in solr.xml will ensure that newly created nodes
are secure, but will not protect the already existing data. To modify all existing ACLs, you can use: ZkCLI -cmd
updateacls /zk-path.
Changing ACLs in ZK should only be done while your SolrCloud cluster is stopped. Attempting to do so while
Solr is running may result in inconsistent state and some nodes becoming inaccessible. To configure the new
ACLs, run ZkCli with the following VM properties: -DzkACLProvider=...
-DzkCredentialsProvider=....
The Credential Provider must be one that has current admin privileges on the nodes. When omitted, the
process will use no credentials (suitable for an unsecure configuration).
The ACL Provider will be used to compute the new ACLs. When omitted, the process will set allpermissions to all users, removing any security present.
You may use the VMParamsSingleSetCredentialsDigestZkCredentialsProvider and VMParamsAll
AndReadonlyDigestZkACLProvider implementations as described earlier in the page for these properties.
After changing the ZK ACLs, make sure that the contents of your solr.xml match, as described for initial set
up


已经存在的节点进行访问控制--重新设置后停止集群通过命令来完成

Example Usages

实现的例子:


Let's say that you want all Solr-related content in ZooKeeper protected. You want an "admin" user that is able to
do anything to the content in ZooKeeper - this user will be used for initializing Solr content in ZooKeeper and for
server-side Solr nodes. You also want a "readonly" user that is only able to read content from ZooKeeper - this
user will be handed over to "clients".
In the examples below:
The "admin" user's username/password is admin-user/admin-password.
The "readonly" user's username/password is readonly-user/readonly-password.
The provider class names must first be configured in solr.xml:
...
<solrcloud>
...
<str
name="zkCredientialsProvider">org.apache.solr.common.cloud.VMParamsSingleSetCredenti
alsDigestZkCredentialsProvider</str>
<str
name="zkACLProvider">org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLP
rovider</str>
To use ZkCLI:
SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user
-DzkDigestPassword=admin-password \
-DzkDigestReadonlyUsername=readonly-user
-DzkDigestReadonlyPassword=readonly-password"
java ... $SOLR_ZK_CREDS_AND_ACLS ... org.apache.solr.cloud.ZkCLI -cmd ...
For operations using bin/solr, add the following at the bottom of bin/solr.in.sh:
SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user
-DzkDigestPassword=admin-password \
-DzkDigestReadonlyUsername=readonly-user
-DzkDigestReadonlyPassword=readonly-password"
SOLR_OPTS="$SOLR_OPTS $SOLR_ZK_CREDS_AND_ACLS"
For operations using bin\solr.cmd, add the following at the bottom of bin\solr.in.cmd:

set SOLR_ZK_CREDS_AND_ACLS=-DzkDigestUsername=admin-user
-DzkDigestPassword=admin-password ^
-DzkDigestReadonlyUsername=readonly-user
-DzkDigestReadonlyPassword=readonly-password
set SOLR_OPTS=%SOLR_OPTS% %SOLR_ZK_CREDS_AND_ACLS%
To start your own "clients" (using SolrJ):
SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=readonly-user
-DzkDigestPassword=readonly-password"
java ... $SOLR_ZK_CREDS_AND_ACLS ...
Or since you yourself are writing the code creating the SolrZkClient-s, you might want to override the
provider implementations at the code level instead.


Collections API
The Collections API is used to enable you to create, remove, or reload collections, but in the context of SolrCloud you can also use it to create collections with a specific number of shards and replicas
作用

API Entry Points
The base URL for all API calls below is http://<hostname>:<port>/solr.
/admin/collections?action=CREATE: create a collection
/admin/collections?action=MODIFYCOLLECTION: Modify certain attributes of a collection
/admin/collections?action=RELOAD: reload a collection
/admin/collections?action=SPLITSHARD: split a shard into two new shards
/admin/collections?action=CREATESHARD: create a new shard
/admin/collections?action=DELETESHARD: delete an inactive shard
/admin/collections?action=CREATEALIAS: create or modify an alias for a collection
/admin/collections?action=DELETEALIAS: delete an alias for a collection
/admin/collections?action=DELETE: delete a collection
/admin/collections?action=DELETEREPLICA: delete a replica of a shard
/admin/collections?action=ADDREPLICA: add a replica of a shard
/admin/collections?action=CLUSTERPROP: Add/edit/delete a cluster-wide property
/admin/collections?action=MIGRATE: Migrate documents to another collection
/admin/collections?action=ADDROLE: Add a specific role to a node in the cluster
/admin/collections?action=REMOVEROLE: Remove an assigned role
/admin/collections?action=OVERSEERSTATUS: Get status and statistics of the overseer
/admin/collections?action=CLUSTERSTATUS: Get cluster status
/admin/collections?action=REQUESTSTATUS: Get the status of a previous asynchronous request
/admin/collections?action=DELETESTATUS: Delete the stored response of a previous asynchronous
request
/admin/collections?action=LIST: List all collections
/admin/collections?action=ADDREPLICAPROP: Add an arbitrary property to a replica specified by
collection/shard/replica
/admin/collections?action=DELETEREPLICAPROP: Delete an arbitrary property from a replica specified 
by collection/shard/replica
/admin/collections?action=BALANCESHARDUNIQUE: Distribute an arbitrary property, one per shard,
across the nodes in a collection
/admin/collections?action=REBALANCELEADERS: Distribute leader role based on the "preferredLeader"
assignments
/admin/collections?action=FORCELEADER: Force a leader election in a shard if leader is lost
/admin/collections?action=MIGRATESTATEFORMAT: Migrate a collection from shared clusterstate.json to per-collection state.json


所有的api 其余的详细的参数描述和例子


Parameter Reference

Cluster Parameters

numShards

SolrCloud Instance Parameters

These are set in solr.xml, but by default the host and hostContext parameters are set up to also work with system properties.

SolrCloud Instance ZooKeeper Parameters


Command Line Utilities

solr的zk 和zk的zk连接的区别及位置

Using Solr's ZooKeeper CLI

-cmd <arg>
CLI Command to be executed: bootstrap, upconfig, downconfig, linkconfi
g, makepath, get, getfile, put, putfile, list, clear or clusterprop.
This parameter is mandatory

还是挺多的

ZooKeeper CLI Examples

Upload a configuration directory
上传一个目录到zk

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \
-cmd upconfig -confname my_new_config -confdir
server/solr/configsets/basic_configs/conf


Bootstrap ZooKeeper from existing SOLR_HOME
不知道干啥的

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 
-cmd bootstrap -solrhome /var/solr/data

Put arbitrary data into a new ZooKeeper file
在zk中创建节点加入数据

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 
-cmd put /my_zk_file.txt 'some data'

Put a local file into a new ZooKeeper file

将本地文件设置为zk的文件

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 
-cmd putfile /my_zk_file.txt /tmp/my_local_file.txt

Link a collection to a configuration set

将集合和配置文件相互映射起来

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 
-cmd linkconfig -collection gettingstarted -confname my_new_config

Create a new ZooKeeper path
创建一个新的zk节点

Set a cluster property

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 
-cmd clusterprop -name urlScheme -val https


---->>功能没有写全,还有许多命令

SolrCloud with Legacy Configuration Files


ConfigSets API
仅仅适用solrcloud模式

API Entry Points
The base URL for all API calls is http://<hostname>:<port>/solr.
/admin/configs?action=CREATE: create a ConfigSet, based on an existing ConfigSet
/admin/configs?action=DELETE: delete a ConfigSet
/admin/configs?action=LIST: list all ConfigSets





Rule-based Replica Placement

When Solr needs to assign nodes to collections, it can either automatically assign them randomly or the user can
specify a set nodes where it should create the replicas. With very large clusters, it is hard to specify exact node
names and it still does not give you fine grained control over how nodes are chosen for a shard. The user should
be in complete control of where the nodes are allocated for each collection, shard and replica. This helps to
optimally allocate hardware resources across the cluster.
Rule-based replica assignment allows the creation of rules to determine the placement of replicas in the cluster.
In the future, this feature will help to automatically add or remove replicas when systems go down, or when
higher throughput is required. This enables a more hands-off approach to administration of the cluster.
This feature is used in the following instances:

Collection creation
Shard creation
Replica creation
Shard splitting


自动化的应对资源的分配 对于集合,分片,副本

Common Use Cases

There are several situations where this functionality may be used. A few of the rules that could be implemented
are listed below:
Don’t assign more than 1 replica of this collection to a host.
Assign all replicas to nodes with more than 100GB of free disk space or, assign replicas where there is
more disk space.
Do not assign any replica on a given host because I want to run an overseer there.
Assign only one replica of a shard in a rack.
Assign replica in nodes hosting less than 5 cores.
Assign replicas in nodes hosting the least number of cores


Rule Conditions

A rule is a set of conditions that a node must satisfy before a replica core can be created there.

必须满足的规则

Rule Conditions
There are three possible conditions.
shard: this is the name of a shard or a wild card (* means for all shards). If shard is not specified, then the
rule applies to the entire collection.
replica: this can be a number or a wild-card (* means any number zero to infinity).
tag: this is an attribute of a node in the cluster that can be used in a rule, e.g. “freedisk”, “cores”, “rack”,
“dc”, etc. The tag name can be a custom string. If creating a custom tag, a snitch is responsible for
providing tags and values. The section Snitches below describes how to add a custom tag, and defines
six pre-defined tags (cores, freedisk, host, port, node, and sysprop).


六个预定义规则
Rule Operators

A condition can have one of the following operators to set the parameters for the rule.
equals (no operator required): tag:x means tag value must be equal to ‘x’
greater than (>): tag:>x means tag value greater than ‘x’. x must be a number
less than (<): tag:<x means tag value less than ‘x’. x must be a number
not equal (!): tag:!x means tag value MUST NOT be equal to ‘x’. The equals check is performed on
String value


规则操作符号

Fuzzy Operator (~)

This can be used as a suffix to any condition. This would first try to satisfy the rule strictly. If Solr can’t find
enough nodes to match the criterion, it tries to find the next best match which may not satisfy the criterion. For
example, if we have a rule such as, freedisk:>200~, Solr will try to assign replicas of this collection on nodes
with more than 200GB of free disk space. If that is not possible, the node which has the most free disk space will
be chosen instead.


Choosing Among Equals
The nodes are sorted first and the rules are used to sort them. This ensures that even if many nodes match the
rules, the best nodes are picked up for node assignment. For example, if there is a rule such as freedisk:>20,
nodes are sorted first on disk space descending and the node with the most disk space is picked up first. Or, if
the rule is cores:<5, nodes are sorted with number of cores ascending and the node with the least number of
cores is picked up first.


选择排序后的最优解

Rules for new shards

The rules are persisted along with collection state. So, when a new replica is created, the system will assign
replicas satisfying the rules. When a new shard is created as a result of create shard ensure that you have
created rules specific for that shard name. Rules can be altered using the modify collection command. However,
it is not required to do so if the rules do not specify explicit shard names. For example, a rule such as shard:sh
ard1,replica:*,ip_3:168:, will not apply to any new shard created. But, if your rule is replica:*,ip_3:
168, then it will apply to any new shard created.
The same is applicable to shard splitting. Shard splitting is treated exactly the same way as shard creation. Even
though shard1_1 and shard1_2 may be created from shard1, the rules treat them as distinct, unrelated
shards.


对于新的分片规则分配


Snitches

Tag values come from a plugin called Snitch. If there is a tag named ‘rack’ in a rule, there must be Snitch which
provides the value for ‘rack’ for each node in the cluster. A snitch implements the Snitch interface. Solr, by
default, provides a default snitch which provides the following tags:
cores: Number of cores in the node
freedisk: Disk space available in the node
host: host name of the node
port: port of the node
node: node name
ip_1, ip_2, ip_3, ip_4: These are ip fragments for each node. For example, in a host with ip 192.168.1.
2, ip_1 = 2, ip_2 =1, ip_3 = 168 and ip_4 = 192
sysprop.{PROPERTY_NAME}: These are values available from system properties. sysprop.key mean
s a value that is passed to the node as -Dkey=keyValue during the node startup. It is possible to use
rules like sysprop.key:expectedVal,shard:*



可以被使用的标签

How Snitches are Configured

It is possible to use one or more snitches for a set of rules. If the rules only need tags from default snitch it need
not be explicitly configured. For example:
snitch=class:fqn.ClassName,key1:val1,key2:val2,key3:val3
How Tag Values are Collected
Identify the set of tags in the rules
Create instances of Snitches specified. The default snitch is always created.
Ask each Snitch if it can provide values for the any of the tags. If even one tag does not have a snitch, the
assignment fails.
After identifying the Snitches, they provide the tag values for each node in the cluster.
If the value for a tag is not obtained for a given node, it cannot participate in the assignment.


如何配置及工作原理讲解



Examples

Keep less than 2 replicas (at most 1 replica) of this collection on any node
For this rule, we define the replica condition with operators for "less than 2", and use a pre-defined tag named 
node to define nodes with any name.

replica:<2,node:*


保证集合中任意节点不能对于两个副本

For a given shard, keep less than 2 replicas on any node
For this rule, we use the shard condition to define any shard name, the replica condition with operators for
"less than 2", and finally a pre-defined tag named node to define nodes with any name.
shard:*,replica:<2,node:*


任何分片上的任何节点副本数目都不能对于两个


Assign all replicas in shard1 to rack 730

This rule limits the shard condition to 'shard1', but any number of replicas. We're also referencing a custom tag
named rack. Before defining this rule, we will need to configure a custom Snitch which provides values for the
tag rack.
shard:shard1,replica:*,rack:730
In this case, the default value of replica is * (or, all replicas). So, it can be omitted and the rule can be reduced
to:
shard:shard1,rack:730


自定义了一个snitch  rack


Create replicas in nodes with less than 5 cores only

This rule uses the replica condition to define any number of replicas, but adds a pre-defined tag named core
and uses operators for "less than 5".
replica:*,cores:<5
Again, we can simplify this to use the default value for replica, like so:
cores:<5


当节点的core数目少于5个可以创建副本

Do not create any replicas in host 192.45.67.3

This rule uses only the pre-defined tag host to define an IP address where replicas should not be placed.
host:!192.45.67.3

不在指定主机上创建副本

Defining Rules

Rules are specified per collection during collection creation as request parameters. It is possible to specify
multiple ‘rule’ and ‘snitch’ params as in this example:
snitch=class:EC2Snitch&rule=shard:*,replica:1,dc:dc1&rule=shard:*,replica:<2,dc:dc3
These rules are persisted in clusterstate.json in Zookeeper and are available throughout the lifetime of the
collection. This enables the system to perform any future node allocation without direct user interaction. The
rules added during collection creation can be modified later using the MODIFYCOLLECTION API.


自己定义一个 snitch





Cross Data Center Replication (CDCR)

The SolrCloud architecture is not particularly well suited for situations where a single SolrCloud cluster consists
of nodes in separated data clusters connected by an expensive pipe. The root problem is that SolrCloud is
designed to support Near Real Time Searching by immediately forwarding updates between nodes in the clusteron a per-shard basis.
 "CDCR" features exist to help mitigate the risk of an entire Data Center outage.
用来减轻完全数据中心的中断

What is CDCR?
Glossary
Architecture
Major Components
CDCR Configuration
CDCR Initialization
Inter-Data Center Communication
Updates Tracking & Pushing
Synchronization of Update Checkpoints
Maintenance of Updates Log
Monitoring
CDC Replicator
Limitations
Configuration
Source Configuration
Target Configuration
Configuration Details
The Replica Element
The Replicator Element
The updateLogSynchronizer Element
The Buffer Element
CDCR API
API Entry Points (Control)
API Entry Points (Monitoring)
Control Commands
Monitoring commands
Initial Startup
Monitoring
ZooKeeper settings
Upgrading and Patching Production


是什么,那些组件,如何配置,如何使用,等




What is CDCR?

The goal of the project is to replicate data to multiple Data Centers. The initial version of the solution will cover
the active-passive scenario where data updates are replicated from a Source Data Center to a Target Data
Center. Data updates include adding/updating and deleting documents.


最初是为了解决将源数据中心的数据变更同步到目标数据中心

Data changes on the Source Data Center are replicated to the Target Data Center only after they are persisted
to disk. The data changes can be replicated in real-time (with a small delay) or could be scheduled to be sent in
intervals to the Target Data Center. This solution pre-supposes that the Source and Target data centers begin
with the same documents indexed. Of course the indexes may be empty to start.


数据的改变能被近时时的同步到目标数据中心,持久化到硬盘上, 开始时应该保证两个数据中的起点一致.

Each shard leader in the Source Data Center will be responsible for replicating its updates to the appropriate
collection in the Target Data Center. When receiving updates from the Source Data Center, shard leaders in the
Target Data Center will replicate the changes to their own replicas.


原数据中心的分片lead负责将更新的内容发送给目标分片的leader,完成备份.

This replication model is designed to tolerate some degradation in connectivity, accommodate limited bandwidth, and support batch updates to optimize communication. 

Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set
up on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure
consistency on the full index. Therefore any index created before CDCR was set up will have to be replicated by
other means (described in the section Starting CDCR the first time with an existing index) in order that Source
and Target indexes be fully consistent.


未开始CDCR之前的索引需要通过另外的方式.来保证索引的一致

The active-passive nature of the initial implementation implies a "push" model from the Source collection to the
Target collection. Therefore, the Source configuration must be able to "see" the ZooKeeper ensemble in the
Target cluster. The ZooKeeper ensemble is provided configured in the Source's solrconfig.xml file.


采取推送的方式,将源集合中的数据推送到目标集合中,所以必须在solrconfig.xml中配置相关的zookeeper配置


CDCR is configured to replicate from collections in the Source cluster to collections in the Target cluster on a
collection-by-collection basis. Since CDCR is configured in solrconfig.xml (on both Source and Target
clusters), the settings can be tailored for the needs of each collection. 
CDCR can be configured to replicate from one collection to a second collection within the same cluster. That is a
specialized scenario not covered in this document.


支持夸集群和本集群的复制

Glossary

Terms used in this document include:
Node: A JVM instance running Solr; a server.
Cluster: A set of Solr nodes managed as a single unit by a ZooKeeper ensemble, hosting one or more
Collections.
Data Center: A group of networked servers hosting a Solr cluster. In this document, the terms Cluster and
Data Center are interchangeable as we assume that each Solr cluster is hosted in a different group of
networked servers.
Shard: A sub-index of a single logical collection. This may be spread across multiple nodes of the cluster.
Each shard can have as many replicas as needed.
Leader: Each shard has one node identified as its leader. All the writes for documents belonging to a
shard are routed through the leader.
Replica: A copy of a shard for use in failover or load balancing. Replicas comprising a shard can either be
leaders or non-leaders.
Follower: A convenience term for a replica that is not the leader of a shard.
Collection: Multiple documents that make up one logical index. A cluster can have multiple collections.
Updates Log: An append-only log of write operations maintained by each node.



Architecture

Here is a picture of the data flow

Updates and deletes are first written to the Source cluster, then forwarded to the Target cluster. The data flow
sequence is:
A shard leader receives a new data update that is processed by its Update Processor.
The data update is first applied to the local index.
Upon successful application of the data update on the local index, the data update is added to the
Updates Log queue.
After the data update is persisted to disk, the data update is sent to the replicas within the Data Center.
After Step 4 is successful CDCR reads the data update from the Updates Log and pushes it to the
corresponding collection in the Target Data Center. This is necessary in order to ensure consistency
between the Source and Target Data Centers.
The leader on the Target data center writes the data locally and forwards it to all its followers.
Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a
background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push
batch updates in order to minimize network communication overhead. Also, if CDCR is unable to push the
update at a given time -- for example, due to a degradation in connectivity -- it can retry later without any impact
on the Source Data Center.
One implication of the architecture is that the leaders in the Source cluster must be able to "see" the leaders in
the Target cluster. Since leaders may change, this effectively means that all nodes in the Source cluster must be
able to "see" all Solr nodes in the Target cluster so firewalls, ACL rules, etc. must be configured with care.


数据的更新过程步骤详解


Major Components

There are a number of key features and components in CDCR’s architecture:

CDCR Configuration

In order to configure CDCR, the Source Data Center requires the host address of the ZooKeeper cluster
associated with the Target Data Center. The ZooKeeper host address is the only information needed by CDCR
to instantiate the communication with the Target Solr cluster. The CDCR configuration file on the Source cluster
will therefore contain a list of ZooKeeper hosts. The CDCR configuration file might also contain
secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings,
etc.


必须配置 zookeeper信息,及一些可选的配置


CDCR Initialization

CDCR supports incremental upddates to either new or existing collections. CDCR may not be able to keep up
with very high volume updates, especially if there are significant communications latencies due to a slow "pipe"
between the data centers. Some scenarios:
There is an initial bulk load of a corpus followed by lower volume incremental updates. In this case, one
can do the initial bulk load, replicate the index and then keep then synchronized via CDCR. See the
section Starting CDCR the first time with an existing index for more information.
The index is being built up from scratch, without a significant initial bulk load. CDCR can be set up on
empty collections and keep them synchronized from the start.
The index is always being updated at a volume too high for CDCR to keep up. This is especially possible
in situations where the connection between the Source and Target data centers is poor. This scenario is
unsuitable for CDCR in its current form.


适用情况的说明


Inter-Data Center Communication

Communication between Data Centers will be achieved through HTTP and the Solr REST API using the SolrJ
client. The SolrJ client will be instantiated with the ZooKeeper host of the Target Data Center. SolrJ will manage
the shard leader discovery process.


通过http和solrj进行跨集群通信

Updates Tracking & Pushing

CDCR replicates data updates from the Source to the Target Data Center by leveraging the Updates Log.

通过更新日志完成备份

A background thread regularly checks the Updates Log for new entries, and then forwards them to the Target
Data Center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update
successfully processed in the Updates Log. Upon acknowledgement from the Target Data Center that updates
have been successfully processed, the Updates Log pointer is updated to reflect the current checkpoint.


源集群后台线程定期检查更新的日志,从上次标记的点开始.更新成功后,目标集群应该返回一个反馈用于记录下次的检查点

This pointer must be synchronized across all the replicas. In the case where the leader goes down and a new
leader is elected, the new leader will be able to resume replication from the last update by using this
synchronized pointer. The strategy to synchronize such a pointer across replicas will be explained next.


这个检查点应该被同步到源集群中的所有节点上

If for some reason, the Target Data Center is offline or fails to process the updates, the thread will periodically try
to contact the Target Data Center and push the updates.

如果目标集群挂了,那么源集群将定期去尝试重新推送


Synchronization of Update Checkpoints

A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to
avoid introducing inconsistency between the Source and Target Data Centers. Another important requirement is
that the synchronization must be performed with minimal network traffic to maximize scalability


in order to achieve this, the strategy is to:

Uniquely identify each update operation. This unique identifier will serve as pointer
Rely on two storages: an ephemeral storage on the Source shard leader, and a persistent storage on the Target cluster.

唯一标识和两地存储

The shard leader in the Source cluster will be in charge of generating a unique identifier for each update
operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be
sent to the Target cluster as part of the update request. On the Target Data Center side, the shard leader will
receive the update request, store it along with the unique identifier in the Updates Log, and replicate it to the
other shards.


SolrCloud is already providing a unique identifier for each update operation, i.e., a “version” number. This version
number is generated using a time-based lmport clock which is incremented for each update operation sent. This
provides an “happened-before” ordering of the update operations that will be leveraged in (1) the initialization of
the update checkpoint on the Source cluster, and in (2) the maintenance strategy of the Updates Log.


solrcloud通过version这个基于时间生成的字段来做唯一标识

The persistent storage on the Target cluster is used only during the election of a new shard leader on the Source
cluster. If a shard leader goes down on the Source cluster and a new leader is elected, the new leader will
contact the Target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a
request, the Target cluster will retrieve the latest identifier received across all the shards, and send it back to the
Source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its
Update Logs and send it back to a coordinator. The coordinator will have to select the highest among them.


当源集群中的分片leader挂了,重新选举出来的leader会去目标集群中取出唯一标识,然后综合取最高版本使用

This strategy does not require any additional network traffic and ensures reliable pointer synchronization.
Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that
every update is applied to the leader but also to any of the replicas. If the leader goes down, a new leader is
elected. During the leader election, a synchronization is performed between the new leader and the other
replicas. As a result, this ensures that the new leader has a consistent Update Logs with the previous leader.
Having a consistent Updates Log means that:
On the Source cluster, the update checkpoint can be reused by the new leader.
On the Target cluster, the update checkpoint will be consistent between the previous and new leader. This
ensures the correctness of the update checkpoint sent by a newly elected leader from the Target cluster.


Maintenance of Updates Log

The CDCR replication logic requires modification to the maintenance logic of the Updates Log on the Source
Data Center. Initially, the Updates Log acts as a fixed size queue, limited to 100 update entries. In the CDCR
scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up
through the last processed update by the Target Data Center. Entries in the Update Logs are removed only when
all pointers (one pointer per Target Data Center) are after them.
If the communication with one of the Target Data Center is slow, the Updates Log on the Source Data Center
can grow to a substantial size. In such a scenario, it is necessary for the Updates Log to be able to efficiently find
a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to
implement an efficient search strategy. Each transaction log file contains as part of its filename the version
number of the first element. This is used to quickly traverse all the transaction log files and find the transaction
log file containing one specific version number.



日志文件的维护


Monitoring

CDCR provides the following monitoring capabilities over the replication operations:
Monitoring of the outgoing and incoming replications, with information such as the Source and Target
nodes, their status, etc.
Statistics about the replication, with information such as operations (add/delete) per second, number of
documents in the queue, etc.
Information about the lifecycle and statistics will be provided on a per-shard basis by the CDC Replicator thread.
The CDCR API can then aggregate this information an a collection level.



提供的一些监控内容


CDC Replicator

The CDC Replicator is a background thread that is responsible for replicating updates from a Source Data
Center to one or more Target Data Centers. It will also be responsible in providing monitoring information on a
per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size
pool of CDC Replicator threads that will be shared across shards.


Limitations

The current design of CDCR has some limitations. CDCR will continue to evolve over time and many of these
limitations will be addressed. Among them are:
CDCR is unlikely to be satisfactory for bulk-load situations where the update rate is high, especially if the
bandwidth between the Source and Target clusters is restricted. In this scenario, the initial bulk load
should be performed, the Source and Target data centers synchronized and CDCR be utilized for
incremental updates.
CDCR is currently only active-passive; data is pushed from the Source cluster to the Target cluster. There
is active work being done in this area in the 6x code line to remove this limitation.


一些目前的缺陷

频繁的更新需要保证带宽足够

目前是主动推送的方式


Configuration

The Source and Target configurations differ in the case of the data centers being in separate clusters. "Cluster"
here means separate ZooKeeper ensembles controlling disjoint Solr instances. Whether these data centers are
physically separated or not is immaterial for this discussion.


需要单独的集群

Source Configuration

一个源集群配置的例子
Here is a sample of a Source configuration file, a section in solrconfig.xml. The presence of the <replica>
section causes CDCR to use this cluster as the Source and should not be present in the Target collections in the
cluster-to-cluster case. Details about each setting are after the two examples:


<requestHandler name="/cdcr" class="solr.CdcrRequestHandler">
<lst name="replica">
<str name="zkHost">10.240.18.211:2181</str>
<str name="Source">collection1</str>
<str name="Target">collection1</str>
</lst>
<lst name="replicator">
<str name="threadPoolSize">8</str>
<str name="schedule">1000</str>
<str name="batchSize">128</str>
</lst>
<lst name="updateLogSynchronizer">
<str name="schedule">1000</str>
</lst>
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog class="solr.CdcrUpdateLog">
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
</updateHandler>
</requestHandler>


Target Configuration

目标集群配置

Here is a typical Target configuration. 
Target instance must configure an update processor chain that is specific to CDCR. The update processor chain
must include the CdcrUpdateProcessorFactory. The task of this processor is to ensure that the version
numbers attached to update requests coming from a CDCR Source SolrCloud are reused and not overwritten by
the Target. A properly configured Target configuration looks similar to this.


<requestHandler name="/cdcr" class="solr.CdcrRequestHandler">
<lst name="buffer">
<str name="defaultState">disabled</str>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<str name="update.chain">cdcr-processor-chain</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name="cdcr-processor-chain">
<processor class="solr.CdcrUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog class="solr.CdcrUpdateLog">
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
</updateHandler>


Configuration Details

The configuration details, defaults and options are as follows:

The Replica Element

CDCR can be configured to forward update requests to one or more replicas. A replica is defined with a “replica” list as follows:

The Replicator Element
复制因子元素

The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitor the update logs of the Source collection and will forward any new updates to the Target collection. The replicator uses a fixed thread pool to forward updates to multiple replicas in parallel. If more than one replica is configured,
one thread will forward a batch of updates from one replica at a time in a round-robin fashion. The replicator can be configured with a “replicator” list as follows:


The updateLogSynchronizer Element

Expert: Non-leader nodes need to synchronize their update logs with their leader node from time to time in order
to clean deprecated transaction log files. By default, such a synchronization process is performed every minute.
The schedule of the synchronization can be modified with a “updateLogSynchronizer” list as follows:


节点和leader节点的日志同步

The Buffer Element
CDCR is configured by default to buffer any new incoming updates. When buffering updates, the updates log will
store all the updates indefinitely. Replicas do not need to buffer updates, and it is recommended to disable buffer
on the Target SolrCloud. The buffer can be disabled at startup with a “buffer” list and the parameter “defaultState” as follows:



CDCR API

The CDCR API is used to control and monitor the replication process. Control actions are performed at a collection level, i.e., by using the following base URL for API calls: http://<hostname>:<port>/solr/<collection>.
Monitor actions are performed at a core level, i.e., by using the following base URL for API calls: http://<hostname>:<port>/solr/<core>.
Currently, none of the CDCR API calls have parameters.


入口和功能

API Entry Points (Control)

collection/cdcr?action=STATUS: Returns the current state of CDCR.
collection/cdcr?action=START: Starts CDCR replication
collection/cdcr?action=STOPPED: Stops CDCR replication.
collection/cdcr?action=ENABLEBUFFER: Enables the buffering of updates.
collection/cdcr?action=DISABLEBUFFER: Disables the buffering of updates



API Entry Points (Monitoring)

core/cdcr?action=QUEUES: Fetches statistics about the queue for each replica and about the update logs.
core/cdcr?action=OPS: Fetches statistics about the replication performance (operations per second) foreach
 replicacore/cdcr?action=ERRORS: Fetches statistics and other information about replication errors for each replica.

Control Commands




Initial Startup


Upload the modified solrconfig.xml to ZooKeeper on both Source and Target
Sync the index directories from the Source collection to Target collection across to the
corresponding shard nodes.
Tip: rsync works well for this.
For example: if there are 2 shards on collection1 with 2 replicas for each shard, copy the
corresponding index directories from

Start the ZooKeeper on the Target (DR) side
Start the SolrCloud on the Target (DR) side
Start the ZooKeeper on the Source side
Start the SolrCloud on the Source side
Tip: As a general rule, the Target (DR) side of the SolrCloud should be started before the
Source side.
Activate the CDCR on Source instance using the cdcr api
http://host:port/solr/collection_name/cdcr?action=START
http://host:port/solr/collection_name/cdcr?action=START


There is no need to run the /cdcr?action=START command on the Target
Disable the buffer on the Target
http://host:port/solr/collection_name/cdcr?action=DISABLEBUFFER
Renable indexing

Monitoring
Network and disk space monitoring are essential. Ensure that the system has plenty of available storage
to queue up changes if there is a disconnect between the Source and Target. A network outage between
the two data centers can cause your disk usage to grow. 
Tip: Set a monitor for your disks to send alerts when the disk gets over a certain percentage (eg.
70%)
Tip: Run a test. With moderate indexing, how long can the system queue changes before you run
out of disk space? 
Create a simple way to check the counts between the Source and the Target.
Keep in mind that if indexing is running, the Source and Target may not match document for
document. Set an alert to fire if the difference is greater than some percentage of the overall cloud
size.


需要监控磁盘空间和两个集群之间索引的数量关系设置警报

ZooKeeper settings

With CDCR, the Target ZooKeepers will have connections from the Target clouds and the Source clouds. 
You may need to increase the maxClientCnxns setting in the zoo.cfg. 

## set numbers of connection to 200 from client
## is maxClientCnxns=0 that means no limit
maxClientCnxns=800


Upgrading and Patching Production
When rolling in upgrades to your indexer or application, you should shutdown the Source (production) and
the Target (DR). Depending on your setup, you may want to pause/stop indexing. Deploy the release or
patch and renable indexing. Then start the Target (DR).
Tip: There is no need to reissue the DISABLEBUFFERS or START commands. These are
persisted.
Tip: After starting the Target, run a simple test. Add a test document to each of the Source clouds. 
Then check for it on the Target. 


#send to the Source
curl http://<Source>/solr/cloud1/update -H 'Content-type:application/json' -d
'[{"SKU":"ABC"}]'
#check the Target
curl "http://<Target>:8983/solr/cloud1/select?q=SKU:ABC&wt=json&indent=true"


Legacy Scaling and Distribution

What Problem Does Distribution Solve?
If searches are taking too long or the index is approaching the physical limitations of its machine, you should
consider distributing the index across two or more Solr servers.
To distribute an index, you divide the index into partitions called shards, each of which runs on a separate
machine. Solr then partitions searches into sub-searches, which run on the individual shards, reporting results
collectively. The architectural details underlying index sharding are invisible to end users, who simply experience
faster performance on queries against very large indexes.


分布式能解决的问题,搜索太慢,索引太大

What Problem Does Replication Solve?
Replicating an index is useful when:
You have a large search volume which one machine cannot handle, so you need to distribute searches
across multiple read-only copies of the index.
There is a high volume/high rate of indexing which consumes machine resources and reduces search
performance on the indexing machine, so you need to separate indexing and searching.
You want to make a backup of the index (see Making and Restoring Backups of SolrCores).


Distributed Search with Index Sharding

Distributing Documents across Shards


Configuring the ReplicationHandler
In addition to ReplicationHandler configuration options specific to the master/slave roles, there are a few
special configuration options that are generally supported (even when using SolrCloud).
maxNumberOfBackups an integer value dictating the maximum number of backups this node will keep
on disk as it receives backup commands.
Similar to most other request handlers in Solr you may configure a set of "defaults, invariants, and/or
appends" parameters corresponding with any request parameters supported by the ReplicationHandl
er when processing commands.


The example below shows a possible 'master' configuration for the ReplicationHandler, including a fixed
number of backups and an invariant setting for the maxWriteMBPerSec request parameter to prevent slaves
from saturating it's network interface


<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="replicateAfter">optimize</str>
<str name="backupAfter">optimize</str>
<str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>
<str name="commitReserveDuration">00:00:10</str>
</lst>
<int name="maxNumberOfBackups">2</int>
<lst name="invariants">
<str name="maxWriteMBPerSec">16</str>
</lst>
</requestHandler>



Replicating solrconfig.xml

In the configuration file on the master server, include a line like the following:

<str name="confFiles">solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml</str>

This ensures that the local configuration solrconfig_slave.xml will be saved as solrconfig.xml on the
slave. All other files will be saved with their original names.
On the master server, the file name of the slave configuration file can be anything, as long as the name is
correctly identified in the confFiles string; then it will be saved as whatever file name appears after the colon
':'.


Configuring the Replication RequestHandler on a Slave Server

The code below shows how to configure a ReplicationHandler on a slave.

<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="slave">
<!-- fully qualified url for the replication handler of master. It is
possible to pass on this as a request param for the fetchindex command -->
<str name="masterUrl">http://remote_host:port/solr/core_name/replication</str>
<!-- Interval in which the slave should poll master. Format is HH:mm:ss . 
If this is absent slave does not poll automatically.
But a fetchindex can be triggered from the admin or the http API -->
<str name="pollInterval">00:00:20</str>
<!-- THE FOLLOWING PARAMETERS ARE USUALLY NOT REQUIRED-->
<!-- To use compression while transferring the index files. The possible
values are internal|external. If the value is 'external' make sure
that your master Solr has the settings to honor the accept-encoding header.
See here for details: http://wiki.apache.org/solr/SolrHttpCompression
If it is 'internal' everything will be taken care of automatically.
USE THIS ONLY IF YOUR BANDWIDTH IS LOW.
THIS CAN ACTUALLY SLOWDOWN REPLICATION IN A LAN -->
<str name="compression">internal</str>
<!-- The following values are used when the slave connects to the master to
download the index files. Default values implicitly set as 5000ms and
10000ms respectively. The user DOES NOT need to specify these unless the
bandwidth is extremely low or if there is an extremely high latency -->
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
<!-- If HTTP Basic authentication is enabled on the master, then the slave
can be configured with the following -->
<str name="httpBasicAuthUser">username</str>
<str name="httpBasicAuthPassword">password</str>
</lst>
</requestHandler>



Setting Up a Repeater with the ReplicationHandler


A master may be able to serve only so many slaves without affecting performance. Some organizations have
deployed slave servers across multiple data centers. If each slave downloads the index from a remote data
center, the resulting download may consume too much network bandwidth. To avoid performance degradation in
cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both
a master and a slave.


To configure a server as a repeater, the definition of the Replication requestHandler in the solrconfi
g.xml file must include file lists of use for both masters and slaves.
Be sure to set the replicateAfter parameter to commit, even if replicateAfter is set to optimize
on the main master. This is because on a repeater (or any slave), a commit is called only after the index is
downloaded. The optimize command is never called on slaves.
Optionally, one can configure the repeater to fetch compressed files from the master through the 
compression parameter to reduce the index download time.
Here is an example of a ReplicationHandler configuration for a repeater:

<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="master">
<str name="replicateAfter">commit</str>
<str name="confFiles">schema.xml,stopwords.txt,synonyms.txt</str>
</lst>
<lst name="slave">
<str
name="masterUrl">http://master.solr.company.com:8983/solr/core_name/replication</str
>
<str name="pollInterval">00:00:60</str>
</lst>
</requestHandler>




Commit and Optimize Operations
The replicateAfter parameter can accept multiple arguments. For example:
<str name="replicateAfter">startup</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">optimize</str>



Slave Replication

副本的复制过程详解

The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the
pollInterval parameter) to check the current index version of the master. If the slave finds out that the
master has a newer version of the index it initiates a replication process. The steps are as follows:
The slave issues a filelist command to get the list of the files. This command returns the names of
the files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any).
The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent
command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding)
to download the full content or a part of each file. If the connection breaks in between, the download
resumes from the point it failed. At any point, the slave tries 5 times before giving up a replication
altogether.
The files are downloaded into a temp directory, so that if either the slave or the master crashes during the
download process, no files will be corrupted. Instead, the current replication will simply abort.
After the download completes, all the new files are moved to the live index directory and the file's
timestamp is same as its counterpart on the master.

A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.

Replicating Configuration Files

To replicate configuration files, list them using using the confFiles parameter. Only files found in the conf dire
ctory of the master's Solr instance will be replicated.
Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is
changed on the master, that file will be replicated only after there is a new commit/optimize on master's index.
Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files
are compared against their checksum. The schema.xml files (on master and slave) are judged to be identical if
their checksums are identical.
As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before
moving them into their ultimate location in the conf directory. The old configuration files are then renamed and
kept in the same conf/ directory. The ReplicationHandler does not automatically clean up these old files.
If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a
core-reload command instead of a commit command.


副本配置文件的自动化更新


Resolving Corruption Issues on Slave Server


If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will
not undertake any action to put itself in sync, until the master has new index data. When a commit operation
takes place on the master, the index version of the master becomes different from that of the slave. The slave
then fetches the list of files and finds that some of the files present on the master are also present in the local
index but with different sizes and timestamps. This means that the master and slave have incompatible indexes.
To correct this problem, the slave then copies all the index files from master to a new index directory and asks
the core to load the fresh index from the new directory.


当索引发生冲突时副本的解决办法


HTTP API Commands for the ReplicationHandler

You can use the HTTP commands below to control the ReplicationHandler's operations


Distribution and Optimization
Optimizing an index is not something most users should generally worry about - but in particular users should be
aware of the impacts of optimizing an index when using the ReplicationHandler.
The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes.
A very large index may take hours. The variables include the size of the index and the speed of the hardware.
Distributing a newly optimized index may take only a few minutes or up to an hour or more, again depending on
the size of the index and the performance capabilities of network connections and disks. During optimization the
machine is under load and does not process queries very well. Given a schedule of updates being driven a few
times an hour to the slaves, we cannot run an optimize with every committed snapshot.
Copying an optimized index means that the entire index will need to be transferred during the next snappull. This

is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a
three-slave one-master configuration, distributing a newly-optimized index takes approximately 80 seconds total.
Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this
optimize were rolled across the query tier, and if each slave node being optimized were disabled and not
receiving queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half.
Additionally, the files would need to be synchronized so that the following the optimize, snappull would not think
that the independently optimized files were different in any way. This would also leave the door open to
independent corruption of indexes instead of each being a perfect copy of the master.
Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken
out of service. The optimized index can be distributed in the background as queries are being normally serviced.
The optimization can occur at any time convenient to the application providing index updates.
While optimizing may have some benefits in some situations, a rapidly changing index will not retain those
benefits for long, and since optimization is an intensive process, it may be better to consider other options, such
as lowering the merge factor (discussed in the section on Index Configuration).


Combining Distribution and Replication

----> 直接使用solrcloud


Merging Indexes


If you need to combine indexes from two different projects or from multiple servers previously used in a
distributed configuration, you can use either the IndexMergeTool included in lucene-misc or the CoreAdminH
andler.
To merge indexes, they must meet these requirements:
The two indexes must be compatible: their schemas should include the same fields and they should
analyze fields the same way.
The indexes must not include duplicate data.
Optimally, the two indexes should be built using the same schema


Using IndexMergeTool

合并:
To merge the indexes, do the following:
Make sure that both indexes you want to merge are closed.
Issue this command:

java -cp $SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-VERSION.jar:$
SOLR/server/solr-webapp/webapp/WEB-INF/lib/lucene-misc-VERSION.jar
org/apache/lucene/misc/IndexMergeTool
/path/to/newindex
/path/to/old/index1
/path/to/old/index2


This will create a new index at /path/to/newindex that contains both index1 and index2.
Copy this new directory to the location of your application's solr index (move the old one aside first, of
course) and start Solr


Using CoreAdmin

The MERGEINDEXES command of the CoreAdminHandler can be used to merge indexes into a new core – either
from one or more arbitrary indexDir directories or by merging from one or more existing srcCore core names.
See the CoreAdminHandler section for details.


Client APIs


This section discusses the available client APIs for Solr. It covers the following topics:
Introduction to Client APIs: A conceptual overview of Solr client APIs.
Choosing an Output Format: Information about choosing a response format in Solr.
Using JavaScript: Explains why a client API is not needed for JavaScript responses.
Using Python: Information about Python and JSON responses.
Client API Lineup: A list of all Solr Client APIs, with links.
Using SolrJ: Detailed information about SolrJ, an API for working with Java applications.
Using Solr From Ruby: Detailed information about using Solr with Ruby applications.
MBean Request Handler: Describes the MBean request handler for programmatic access to Solr server statistics
and information.



Introduction to Client APIs

At its heart, Solr is a Web application, but because it is built on open protocols, any type of client application can
use Solr.
HTTP is the fundamental protocol used between client applications and Solr. The client makes a request and
Solr does some work and provides a response. Clients use requests to ask Solr to do things like perform queries
or index documents.
Client applications can reach Solr by creating HTTP requests and parsing the HTTP responses. Client APIs
encapsulate much of the work of sending requests and parsing responses, which makes it much easier to write
client applications.
Clients use Solr's five fundamental operations to work with Solr. The operations are query, index, delete, commit,
and optimize.
Queries are executed by creating a URL that contains all the query parameters. Solr examines the request URL,
performs the query, and returns the results. The other operations are similar, although in certain cases the HTTP
request is a POST operation and contains information beyond whatever is included in the request URL. An index
operation, for example, may contain a document in the body of the request.
Solr also features an EmbeddedSolrServer that offers a Java API without requiring an HTTP connection. For
details, see Using SolrJ.



Choosing an Output Format

Many programming environments are able to send HTTP requests and retrieve responses. Parsing the
responses is a slightly more thorny problem. Fortunately, Solr makes it easy to choose an output format that will
be easy to handle on the client side.
Specify a response format using the wt parameter in a query. The available response formats are documented in
Response Writers.
Most client APIs hide this detail for you, so for many types of client applications, you won't ever have to specify a
wt parameter. In JavaScript, however, the interface to Solr is a little closer to the metal, so you will need to add
this parameter yourself.


Client API Lineup


Using JavaScript
Using Solr from JavaScript clients is so straightforward that it deserves a special mention. In fact, it is so
straightforward that there is no client API. You don't need to install any packages or configure anything.
HTTP requests can be sent to Solr using the standard XMLHttpRequest mechanism.
Out of the box, Solr can send JavaScript Object Notation (JSON) responses, which are easily interpreted in
JavaScript. Just add wt=json to the request URL to have responses sent as JSON.
For more information and an excellent example, read the SolJSON page on the Solr Wiki:
http://wiki.apache.org/solr/SolJSON



Using SolrJ






到现在算是把官方文档还算细致的看了一遍,也仅仅是有个印象.不过内容颇多,如果是做solr这块还是值得一看的.


0 0
原创粉丝点击