SolrCloud Wiki翻译(3)Shards & Indexing Data

来源:互联网 发布:淘宝联盟文案生成器 编辑:程序博客网 时间:2024/05/20 00:52

出处:http://my.oschina.net/zengjie/blog/198865

摘要 新版SolrCloud wiki翻译并且记录一下,加以巩固 原文地址 https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud 本文主要是对shard的功能以及一些行为的解释,另外阐述了文档数据在集群中索引的过程。
SolrLucene SolrCloud ZooKeeper

目录[-]

  • 如果当前节点是replica,文档将会转发给leader进行处理 如果当前节点是leader,SolrCloud会确定该文档应该在哪个shard上面进行处理,并且把文档发送给指定shard的leader节点,leader节点收到请求后会处理该文档,并且把索引数据发送给自己和全部的replica节点。
  • Document Routing
  • 文档路由
  • Shard Splitting
  • Shard分割
  • When your data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.

    当你的数据放在一个节点上显得太臃肿的时候,你可以通过创建一个或者多个shard把他们分割开并且存储到这多个shard中。每一个shard都是逻辑索引或者说是core的一部分,并且它是包含了指定分段索引的所有节点的一个集合。

    A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined.

    shard是把一个core分割到多个server或者node上面的一种方式。例如,你可以用shard表示可能会被用作单独搜索的每个国家的数据,或者是不同目录里面的数据,但是所有的这些数据通常都是整合在一起的。

    Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not a exclusively  SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud:

    在拥有SolrCloud之前Solr就已经支持了分布式搜索,它允许一个查询分发到多个索引碎片上执行,所以查询是在完整的Solr索引上执行的,并且搜索结果中不会丢失任何文档。所以把core分割到多个shard上面并不是SolrCloud独有的理念。然而这种分布式会造成的许多问题,使用SolrCloud来加强分布式处理成为了一个必要的存在,如下:

    1. Splitting of the core into shards was somewhat manual.
    2. There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
    3. There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.


    1. 把core分开到多个shard里面的操作大部分需要手动操作。
    2. 原有的分布式处理不支持分布式索引操作,这意味着你需要明确的把文档发送到指定的shard上面去;Solr不能够自己决定要把文档发送给哪个shard。
    3. 没有负载均衡和故障转移的特性,因此如果你的shard接受了大量的查询请求并导致该shard宕机的时候,你需要自己确定要把请求发送到哪里去。

    SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness.

    SolrCloud解决了所有上述问题。它既支持分布式索引处理,也支持分布式自动查询,ZooKeeper会提供故障转移和负载均衡的特性。另外,每个shard都可以拥有多个replica来增加额外的应用健壮性。

    Unlike Solr 3.x, in SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..

    不像Solr 3.x一样,在SolrCloud里面没有master和slave的存在,取而代之的是leader和replica。leader是自动选举出来的,leader选举首先是基于一个“先到先服务”的原则,然后才是基于ZooKeeper处理(关于ZooKeeper的leader选举的叙述http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection)

    If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard ID.

    如果leader宕机了,replica节点中的某一个节点将会自动被选举成新的leader。在每一个节点启动之后,它都是自动分配给拥有replica最少的shard。当所有shard拥有一样数量的replica的时候,新的节点会被分配给shard id值最小的shard。

    When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.

    当一个文档被发送到一台主机进行索引的时候,系统会先确定当前主机是replica还是leader。

    • If the machine is a replica, the document is forwarded to the leader for processing.
    • If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.

    • 如果当前节点是replica,文档将会转发给leader进行处理
    • 如果当前节点是leader,SolrCloud会确定该文档应该在哪个shard上面进行处理,并且把文档发送给指定shard的leader节点,leader节点收到请求后会处理该文档,并且把索引数据发送给自己和全部的replica节点。


    Document Routing

    文档路由

    Solr 4.1 added the ability to co-locate documents to improve query performance.

    Solr4.1添加了文档聚类(译注:此处翻译准确性需要权衡,意思是将文档归类在一起的意思)的功能来提升查询性能。

    Solr 4.5 has added the ability to specify the router implementation with the router.name parameter. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.

    Solr4.5添加了通过一个router.name参数来指定一个特定的路由器实现的功能。如果你使用“compositeId”路由器,你可以在要发送到Solr进行索引的文档的ID前面添加一个前缀,这个前缀将会用来计算一个hash值,Solr使用这个hash值来确定文档发送到哪个shard来进行索引。这个前缀的值没有任何限制(比如没有必要是shard的名称),但是它必须总是保持一致来保证Solr的执行结果一致。例如,你需要为不同的顾客聚类文档,你可能会使用顾客的名字或者是ID作为一个前缀。比如你的顾客是“IBM”,如果你有一个文档的ID是“12345”,把前缀插入到文档的id字段中变成:“IBM!12345”,在这里感叹号是一个分割符号,这里的“IBM”定义了这个文档会指向一个特定的shard。

    Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_route_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

    然后在查询的时候,你需要把这个前缀包含到你的_route_参数里面(比如:q=solr&_route_=IBM!)使查询指向指定的shard。在某些情况下,这样操作能提升查询的性能,因为它省掉了需要在所有shard上查询耗费的网络传输用时。

    The _route_ parameter replaces shard.keys, which has been deprecated and will be removed in a future Solr release.

    使用_route_代替shard.keys参数。shard.keys参数已经过时了,在Solr的未来版本中这个参数会被移除掉。

    If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

    如果你不想变动文档的存储过程,那就不需要在文档的ID前面添加前缀。

    If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_ parameter to name a specific shard.

    如果你创建了collection并且在创建的时候指定了一个“implicit”路由器,你可以另外定义一个router.field参数,这个参数定义了通过使用文档中的一个字段来确定文档是属于哪个shard的。但是,如果在一个文档中你指定的字段没有值得话,这个文档Solr会拒绝处理。同时你也可以使用_route_参数来指定一个特定的shard。

    Shard Splitting

    Shard分割

    Until Solr 4.3, when you created a collection in SolrCloud, you had to decide on your number of shards when you created the collection and you could not change it later. It can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

    直到Solr4.3时,当你在SolrCloud里面创建一个collection的时候,你必须在创建的时候就确定好shard的数量,并且这个数量在日后都不能修改的。要知道你日后会需要多少个shard有点难,特别是你的需求随时都可能会变,日后来找出用了一个错误的shard数量的代价可能会非常大,还包括需要创建新的core和重新索引所有的数据。

    The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready.

    在Collection API中包含了分割shard的功能。现在允许通过它来把一个shard分开到两个块中。原来存在的shard还是会保持原状,所以分割操作实际上是创建了它的数据的两个副本作为新的shard(译注:这里应该是把原来shard里面的数据作为一个副本分开到两个新的shard里面去)。当你一切都准备好了之后,你可以把旧的shard给删除掉。

    注:关于shard分割的详细操作和工作原理可以看一下searchhub上的一篇文章

    More details on how to use shard splitting is in the section on the Collections API.

    关于怎么使用shard分割的更多细节在Collections API.

    全文完


    0 0
    原创粉丝点击