Solr Performance Factors(solr性能优化因素分析及配置)

来源:互联网 发布:国际离岸中心知乎 编辑:程序博客网 时间:2024/05/17 07:53

Schema Design Considerations(schema配置文件爱你注意事项)

indexed fields

The number of indexed fields greatly increases the following:

  • Memory usage during indexing
  • Segment merge time
  • Optimization times
  • Index size

These effects can be reduced by the use of omitNorms="true"

翻译:

索引字段

索引字段的数量大大增加了以下内容

    索引过程中的内存使用
    片段合并时间
    优化时间
    索引大小

这些影响可以使用
omitNorms="true" 属性来减少

Stored fields

Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document).

Consider storing large fields outside of Solr. If you feel inclined to do so regardless, consider using compressed fields, which increase the CPU cost of storing and retrieving the fields, but lowers the I/O burden and CPU usage.

If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used.

翻译:

存储字段

回收存储字段的结果是一个显然费用成本的影响主要是通过存储每个文档字节 -较高字节数稀疏文件分布在磁盘上更多的I /O是必要检索字段通常存储大字段这是一个问题,如一个文件的全部内容

考虑将大的存储域放在Solr外部。如果你觉得无论如何都要这样,可以考虑使用压缩领域,从而增加存储和检索领域的CPU成本降低了I/O负担和CPU使用率

如果并不总是使用所有的存储领域,然后使懒加可以是一个巨大的好处,尤其是使用压缩领域

Configuration Considerations(配置注意事项)

mergeFactor

The mergeFactor roughly determines the number of segments.

The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.

For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.

These values are set in the *mainIndex* section of solrconfig.xml (disregard the indexDefaults section):

mergeFactor Tradeoffs

High value merge factor (e.g., 25):

  • Pro: Generally improves indexing speed
  • Con: Less frequent merges, resulting in a collection with more index files which may slow searching

Low value merge factor (e.g., 2):

  • Pro: Smaller number of index files, which speeds up searching.
  • Con: More segment merges slow down indexing.

翻译:

合并银子的大致确定分片个数

合并因子的告诉Lucene的建立多少个大小相等的片段时3将它们合并成一个单一的片段它可以被认为一个数字系统的基础

例如,如果设置mergeFactor为10每1000maxBufferedDocs添加到索引文件,一个新的分片在磁盘上创建大小1000的第10个片段被创建完成所有10个片段将被合并成一个单一的大小为10,000片段大小10,000的片段创建10个时,他们合并成一个100,000个文档的片段,等等因此在任何时间,每个索引的大小不超过9个节段

这些值都设置* mainIndex*部分solrconfig.xml中无视indexDefaults部分

mergeFactor的权衡

高值的合并因子(例如,25)

    优点提高索引速度
    缺点:不太频繁合并集合更多的索引文件可能会降低搜索速度

低值的合并因子(例如,2

    优点:索引文件数量较少,从而加快了搜索
    缺点:更多索引片段合并降低索引速度


HashDocSet Max Size Considerations

The hashDocSet is an optimization specified in the solrconfig.xml that enables an int hash representation for filters (docSets) when the number of items in the set is less than maxSize. For smaller sets, this representation is more memory efficient, more efficient to iterate, and faster to take intersections.

The hashDocSet max size should be based primarliy on the number of documents in the collection -- the larger the number of documents, the larger the hashDocSet max size. You will have to do a bit of trial-and-error to arrive at the optimal number:

  1. Calulate 0.005 of the total number of documents that you are going to store.
  2. Try values on either 'side' of that value to arrive at the best query times.
  3. When query times seem to plateau, and performance doesn't show much difference between the higher number and the lower, use the higher.

Note: hashDocSet is no longer part of Solr as of version 1.4.0, seeSOLR-1169.

翻译:

HashDocSet最大尺寸注意事项

集合中小于MAXSIZE时,solrconfig.xmlhashDocSet是一种优化方式,它是通过使用int类型的散列来提高文档过滤效率的对于较小集合,这表示更多的内存效率,更高效的迭代更快地执行交叉

最大hashDocSet大小应根据primarliy集合中的文档数 -较大的文件的数量,则设置较大的hashDocSet值为了到达最佳数量你将不得不做一些试验和错误

    1,Calulate你要存储文件总数的0.005
    2,另外一种策略是达到最好查询时间
   3, 查询时间处于高位,并且性能没有表现出很大区别时使用值大的配置

hashDocSet不再属于Solr的版本1.4.0SOLR-1169



Cache autoWarm Count Considerations

When a new searcher is opened, its caches may be prepopulated or "autowarmed" with cached object from caches in the old searcher.autowarmCount is the number of cached items that will be copied into the new searcher. You will proably want to base the autowarmCount setting on how long it takes to autowarm. You must consider the trade-off -- time-to-autowarm versus how warm (i.e., autowarmCount) you want the cache to be. The autowarm parameter is set for the caches in solrconfig.xml.

See also the Solr Caching page.

Cache hit rate

Monitor the cache statistics from Solr's admin! Raising Solr's cache size is often the best way to improve performance, especially if you notice many evictions for a particular cache type. Pay particular attention to thefilterCache, which is also used internally by Solr for facetting. See alsoSolrCaching andthis FAQ entry.

Explicit Warming of Sort Fields

If you do a lot of field based sorting, it is advantageous to add explicitly warming queries to the "newSearcher" and "firstSearcher" event listeners in your solrconfig which sort on those fields, so the FieldCache is populated prior to any queries being executed by your users.

翻译:

缓存autoWarm(自动预热)计数注意事项

打开一个搜索对象,其缓存可以预先填入预热搜索缓存对象autowarmCount 的值使用来设置是多少缓存将被复制到搜索对象您可以基于设置autowarmCount大小来设置需要多长时间autowarm你必须权衡预热时间及预热文档个数在solrconfig.xml配置中来设置autowarm参数

另请参阅Solr的缓存页面

缓存高命中率

通过solr管理界面监控缓存统计提高Solr的缓存大小往往是提升性能最好的方式,尤其是当你注意到一个特定的缓存类型没有被命中要特别注意过滤缓存器它也被应用

与solr内部及facet高级特性

明确预热排序字段


Optimization Considerations

You may want to optimize an index in certain situations -- ie: if you build your index once, and then never modify it.

If you have a rapidly changing index, rather than optimizing, you likely simply want to use a lower merge factor. Optimizing is very expensive, and if the index is constantly changing, the slight performance boost will not last long. The tradeoff is not often worth it for a non static index.

In a master slave setup, sometimes you may also want to optimize on the master so that slaves serve from a single segment index. This will can greatly increase the time to replicate the index though, so this is often not desirable either.

翻译:

优化 注意事项

你可能想在某些情况下优化索引如你建立索引一次然后就再也没有修改它

如果你有一个迅速变化的索引,不是优化你可能只是想使用较低的合并因子优化代价是非常大的如果该索引不断变化的,稍许的性能提升不会持续太久。非静态的索引这样做是不值的

主从设置有时可能也想在主服务器上进行优化,使成为单一片段指数这将可能大大增大复制索引的时间所以这通常是不可取的

Updates and Commit Frequency Tradeoffs

If slaves receive new collections too frequently their performance will suffer. In order to avoid this type of degradation you must understand how a slave receives a collection update so that you can know how to best adjust the relevant parameters (number/frequency of commits, snappullers, and autowarming/autocount) so that new collections do not get installed on slaves too frequently.

  1. A snapshot of the collection is taken every time a client runs a commit, or an optimization is run depending on whetherpostCommit or postOptimize hooks are used on the master.

  2. Snappullers on the slaves running on a cron'd basis check the master for new snapshots. If the snappullers find a new collection version the slaves pull it down and snapinstall it.
  3. Every time a new index searcher is opened, some autowarming of the cache occurs before Solr hands queries over to that version of the collection. It is crucial to individual query latency that queries have warmed caches.

The three relevant parameters:

  • The number/frequency of snapshots is completely up to the indexing client. Therefore, the number of versions of the collection is determined by the client's activity.

  • The snappullers are cron'd. They could run every second, once a day, or anything in between. When they run, they will retrieve only the most recent collection that they do not have.

  • Cache autowarming is configured for each cache in solrconfig.xml.

If you desire frequent new collections in order for your most recent changes to appear "live online", you must have both frequent commits/snapshots and frequent snappulls. The most frequently you can distribute index changes and maintain good performance is probably in the range of 1 to 5 minutes, depending on your reliance on caching for good query times, and the time it takes to autowarm those caches.

Cache autowarming may be crucial to performance. On one hand a new cache version must be populated with enough entries so that subsequent queries will be served from the cache after the system switches to the new version of the collection. On the other hand, autowarming (populating) a new collection could take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller too frequently, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection, a second “new” one could be snapped and begin warming!

If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum. A completely warmed collection would never make it to full term before it was aborted. This can be prevented with a properly tuned configuration so new collections do not get installed too rapidly.

索引:

更新和提交频率权衡

如果从节点过频繁地收到新的集合他们的性能将受到影响。为了避免这种类型的影响,你必须了解如何从主节点接收文档,这样就可以知道如何最好地调整相关的参数(提交频繁度,快照拉取器数,和预热数) ,所以新的集合不适合从节点过于频繁。

    
收集的快照每次客被户端提交,取决于是否触发了提交文档数最大数限制或者触发了优化钩子程序。

   snappullers(暂且命名快照拉取器)运行在从节点,并检查主节点的快照版本如果snappullers发现一个新的集合版本就会把它拉下来,并安装该快照 。

   每当一个新的索引搜索器被打开时,预热程序会先于查询程序发生,这对于独立的查询来说是至关重要的。

三个相关参数:

    
快照的数量/频率完全取决于索引客户端。因此,集合的数量的版本是由客户端激活的。

  
拉取器是配置的。他们可以每秒一次,每天一次,或任何频率。当它们运行时,它们只会撷取最新的从节点没有的集合。

    
缓存autowarming是在solrconfig.xml配置文件中被配置的。

如果你希望最新的集合能够近实时的展示 ,您必须设置既频繁的提交/快照,并且要设置快照拉取器的频率够快。最经常设置的频率并保持良好的性能大概是在1至5分钟的范围内,还要根据预热缓存查询时间及缓存生成时间

缓存可能是至关重要autowarming的性能。一方面,新的缓存版本必须配备足够的项目,以便后续查询将被送达后从缓存系统切换到新版本的集合。另一方面,一个新的集合autowarming (填充)可能需要大量的时间,特别是因为它仅使用一个线程和一个CPU 。如果您的设置消防关闭snapinstaller太频繁,那么一个Solr的奴隶可以在交接查询(旧)收集的不良状况,而升温一个新的集合,第二个“新”人可以被抢购一空,并开始气候变暖!

如果我们试图解决这样的情况,我们将不得不为了使用第二个无效的第一个“新”集合,然后将卡时, “第三个”新收集和温暖,我们将不得不“第二无效“新收藏,并如此循环往复。一个完全回暖的集合,绝不会让它足月前被中止。这是可以预防与适当调整配置,使新的集合不安装过快。

未完待更


Query Response Compression

Compressing the Solr XML response before it is sent back to the client is worthwhile in some circumstances. If responses are very large, and NIC I/O limits are encroached,and Gigabit ethernet is not an option, using compression is a way out.

Compression increases CPU use and since Solr is typically a CPU-bound service, compressiondiminishes query performance. Compression attempts to reduce files to 1/6th original size, and network packets to 1/3rd original size. (We're not taking the time right now to figure out if the big gap between bits and packets makes sense or not, but suffice it to say it's a nice reduction.) Query performance is impacted ~15% on the Solr server.

Consult the documentation for the application server you are using (ie: Tomcat, Resin, Jetty, etc...) for more information on how to configure page compression.

Indexing Performance

In general, adding many documents per update request is faster than one per update request.

For bulk updating from a Java client, (in 3.X) consider using theStreamingUpdateSolrServer.java which streams updates over multiple connections using multiple threads.N.B. In 4.X, the StreamingSolrServer has been deprecated in favour of the ConcurrentUpdateSolrServer.

Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection.

RAM Usage Considerations

OutOfMemoryErrors

If your Solr instance doesn't have enough memory allocated to it, the Java virtual machine will sometimes throw a JavaOutOfMemoryError. There is no danger of data corruption when this occurs, and Solr will attempt to recover gracefully. Any adds/deletes/commits in progress when the error was thrown are not likely to succeed, however. Other adverse effects may arise. For instance, if the SimpleFSLock locking mechanism is in use (as is the case in Solr 1.2), an ill-timed OutOfMemoryError can potentially cause Solr to lose its lock on the index. If this happens, further attempts to modify the index will result in

SEVERE: Exception during commit/optimize:java.io.IOException: Lock obtain timed out: SimpleFSLock@/tmp/lucene-5d12dd782520964674beb001c4877b36-write.lock

errors.

If you want to see the heap when OOM occurs set "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump" (courtesy of Bill Au)

Memory allocated to the Java VM

The easiest way to fight this error, assuming the Java virtual machine isn't already using all your physical memory, is to increase the amount of memory allocated to the Java virtual machine running Solr. To do this for the example/ in the Solr distribution, if you're running the standard Sun virtual machine, you can use the -Xms and -Xmx command-line parameters:

java -Xms512M -Xmx1024M -jar start.jar

Factors affecting memory usage

You may also wish to try to actually reduce Solr's memory usage.

One factor is the size of the input document:

When processing an "add" command for a document, the standard XML update handler has two limitations:

  • All of the document's fields must simultaneously fit into memory. (Technically, it's actually the sum of min(<the actual field value's length>, maxFieldLength). As such, adjusting maxFieldLength may be of some help.)

    • (I'm assuming that fields are truncated to maxFieldLength before being added to the relevant document object. If that's not true, then maxFieldLength won't help here. --ChrisHarris)

  • Each individual <field>...</field> tag in the input XML must fit into memory, regardless of maxFieldLength.

Note that several different "add" commands can be running simultaneously (in different threads). The more threads, the greater the memory usage.

When indexing, memory usage will grow with the number of documents indexed until a commit is performed. A commit (including a soft commit) will free up almost all heap memory. To avoid very large heaps and associated garbage collection pauses during indexing, perform a manual (soft) commit periodically, or consider enabling autoCommit (or autoSoftCommit) in solrconfig.xml.
原创粉丝点击