Solr Performance Factors(solr性能优化因素分析及配置)

来源：互联网发布：国际离岸中心知乎编辑：程序博客网时间：2024/05/17 07:53

Schema Design Considerations(schema配置文件爱你注意事项)

indexed fields

The number of indexed fields greatly increases the following:

Memory usage during indexing
Segment merge time
Optimization times
Index size

These effects can be reduced by the use of omitNorms="true"

翻译：

索引字段

索引字段的数量大大增加了以下内容：

    索引过程中的内存使用
    片段合并时间
    优化时间
    索引大小

这些影响可以使用omitNorms="true" 属性来减少

Stored fields

Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document).

Consider storing large fields outside of Solr. If you feel inclined to do so regardless, consider using compressed fields, which increase the CPU cost of storing and retrieving the fields, but lowers the I/O burden and CPU usage.

If you aren't always using all the stored fields, then enabling lazy field loading can be a huge boon, especially if compressed fields are used.

翻译：

存储字段

回收存储字段的结果是一个显然的费用。这成本的影响主要是通过存储每个文档的字节数 -较高的字节数，稀疏文件将分布在磁盘上和更多的I /O是必要的检索字段（通常存储大字段时，这是一个问题，如一个文件的全部内容）。

考虑将大的存储域放在Solr外部。如果你觉得无论如何都要这样，可以考虑使用压缩的领域，从而增加存储和检索领域的CPU成本，但降低了I/O负担和CPU使用率。

如果你并不总是使用所有的存储领域，然后使懒加载可以是一个巨大的好处，尤其是使用压缩领域。

Configuration Considerations(配置注意事项)

mergeFactor

The mergeFactor roughly determines the number of segments.

The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.

For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.

These values are set in the *mainIndex* section of solrconfig.xml (disregard the indexDefaults section):

mergeFactor Tradeoffs

High value merge factor (e.g., 25):

Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching

Low value merge factor (e.g., 2):

Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.

翻译：

合并银子的大致确定分片个数。

合并因子的值告诉Lucene的建立多少个大小相等的片段时3，将它们合并成一个单一的片段。它可以被认为是一个数字系统的基础。

例如，如果你设置mergeFactor为10，每1000（或maxBufferedDocs）添加到索引文件，一个新的分片将在磁盘上创建。当大小为1000的第10个片段被创建完成，所有10个片段将被合并成一个单一的大小为10,000片段。当大小为10,000的片段创建10个时，他们将被合并成一个含100,000个文档的片段，等等。因此，在任何时间，将在每个索引的大小不超过9个节段。

这些值都设置* mainIndex*部分solrconfig.xml中（无视indexDefaults部分）：

mergeFactor的权衡

高值的合并因子（例如，25）：

    优点：提高索引速度
    缺点：不太频繁的合并，集合更多的索引文件，可能会降低搜索速度

低值的合并因子（例如，2）：

    优点：索引文件的数量较少，从而加快了搜索。
    缺点：更多索引片段合并降低索引速度。

HashDocSet Max Size Considerations

The hashDocSet is an optimization specified in the solrconfig.xml that enables an int hash representation for filters (docSets) when the number of items in the set is less than maxSize. For smaller sets, this representation is more memory efficient, more efficient to iterate, and faster to take intersections.

The hashDocSet max size should be based primarliy on the number of documents in the collection -- the larger the number of documents, the larger the hashDocSet max size. You will have to do a bit of trial-and-error to arrive at the optimal number:

Calulate 0.005 of the total number of documents that you are going to store.
Try values on either 'side' of that value to arrive at the best query times.
When query times seem to plateau, and performance doesn't show much difference between the higher number and the lower, use the higher.

Note: hashDocSet is no longer part of Solr as of version 1.4.0, seeSOLR-1169.

翻译：

HashDocSet最大尺寸注意事项

集合中的项数小于MAXSIZE时，solrconfig.xml的hashDocSet是一种优化方式，它是通过使用int类型的散列来提高文档过滤效率的。对于较小的集合，这表示更多的内存效率，更高效的迭代，并更快地执行交叉。

最大hashDocSet大小应根据primarliy集合中的文档数 -较大的文件的数量，则设置较大的hashDocSet值。为了到达最佳数量，你将不得不做一些试验和错误：

    1，Calulate你要存储的文件总数的0.005。
    2，另外一种策略是达到最好的查询时间。
   3，当查询时间处于高位，并且性能并没有表现出很大的区别时，使用值大的配置。

注：hashDocSet不再属于Solr的版本1.4.0，见SOLR-1169。

Cache autoWarm Count Considerations

When a new searcher is opened, its caches may be prepopulated or "autowarmed" with cached object from caches in the old searcher.autowarmCount is the number of cached items that will be copied into the new searcher. You will proably want to base the autowarmCount setting on how long it takes to autowarm. You must consider the trade-off -- time-to-autowarm versus how warm (i.e., autowarmCount) you want the cache to be. The autowarm parameter is set for the caches in solrconfig.xml.

Cache hit rate

Monitor the cache statistics from Solr's admin! Raising Solr's cache size is often the best way to improve performance, especially if you notice many evictions for a particular cache type. Pay particular attention to thefilterCache, which is also used internally by Solr for facetting. See alsoSolrCaching andthis FAQ entry.

Explicit Warming of Sort Fields

If you do a lot of field based sorting, it is advantageous to add explicitly warming queries to the "newSearcher" and "firstSearcher" event listeners in your solrconfig which sort on those fields, so the FieldCache is populated prior to any queries being executed by your users.

翻译：

缓存autoWarm（自动预热）计数注意事项

当打开一个新的搜索对象，其缓存可以预先填入或预热旧的搜索缓存对象。autowarmCount 的值使用来设置是多少缓存项将被复制到新的搜索对象。，您可以基于设置autowarmCount的大小来设置需要多长时间来autowarm。你必须权衡预热时间及预热文档个数。在solrconfig.xml配置中来设置autowarm参数。

另请参阅Solr的缓存页面。

缓存高命中率

通过solr管理界面监控缓存统计！提高Solr的缓存大小往往是提升性能最好的方式，尤其是当你注意到一个特定的缓存类型没有被命中。要特别注意过滤缓存器。它也被应用

与solr内部及facet高级特性

明确预热排序字段

Optimization Considerations

You may want to optimize an index in certain situations -- ie: if you build your index once, and then never modify it.

If you have a rapidly changing index, rather than optimizing, you likely simply want to use a lower merge factor. Optimizing is very expensive, and if the index is constantly changing, the slight performance boost will not last long. The tradeoff is not often worth it for a non static index.

In a master slave setup, sometimes you may also want to optimize on the master so that slaves serve from a single segment index. This will can greatly increase the time to replicate the index though, so this is often not desirable either.

翻译：

优化注意事项

你可能想在某些情况下优化索引：如你建立索引一次，然后就再也没有修改它。

如果你有一个迅速变化的索引，而不是优化，你可能只是想使用较低的合并因子。优化代价是非常大的，如果该索引是不断变化的，稍许的性能提升不会持续太久。非静态的索引这样做是不值的

主从设置，有时你可能也想在主服务器上进行优化，使从成为单一片段指数。这将可能大大增大复制索引的时间，所以这通常是不可取的。

Updates and Commit Frequency Tradeoffs

If slaves receive new collections too frequently their performance will suffer. In order to avoid this type of degradation you must understand how a slave receives a collection update so that you can know how to best adjust the relevant parameters (number/frequency of commits, snappullers, and autowarming/autocount) so that new collections do not get installed on slaves too frequently.

A snapshot of the collection is taken every time a client runs a commit, or an optimization is run depending on whetherpostCommit or postOptimize hooks are used on the master.
Snappullers on the slaves running on a cron'd basis check the master for new snapshots. If the snappullers find a new collection version the slaves pull it down and snapinstall it.
Every time a new index searcher is opened, some autowarming of the cache occurs before Solr hands queries over to that version of the collection. It is crucial to individual query latency that queries have warmed caches.

The three relevant parameters:

The number/frequency of snapshots is completely up to the indexing client. Therefore, the number of versions of the collection is determined by the client's activity.
The snappullers are cron'd. They could run every second, once a day, or anything in between. When they run, they will retrieve only the most recent collection that they do not have.
Cache autowarming is configured for each cache in solrconfig.xml.

If you desire frequent new collections in order for your most recent changes to appear "live online", you must have both frequent commits/snapshots and frequent snappulls. The most frequently you can distribute index changes and maintain good performance is probably in the range of 1 to 5 minutes, depending on your reliance on caching for good query times, and the time it takes to autowarm those caches.

Cache autowarming may be crucial to performance. On one hand a new cache version must be populated with enough entries so that subsequent queries will be served from the cache after the system switches to the new version of the collection. On the other hand, autowarming (populating) a new collection could take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller too frequently, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection, a second “new” one could be snapped and begin warming!

If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum. A completely warmed collection would never make it to full term before it was aborted. This can be prevented with a properly tuned configuration so new collections do not get installed too rapidly.

索引：

更新和提交频率权衡

如果从节点过频繁地收到新的集合他们的性能将受到影响。为了避免这种类型的影响，你必须了解如何从主节点接收文档，这样就可以知道如何最好地调整相关的参数（提交频繁度，快照拉取器数，和预热数），所以新的集合不适合从节点过于频繁。

收集的快照每次客被户端提交，取决于是否触发了提交文档数最大数限制或者触发了优化钩子程序。

snappullers（暂且命名快照拉取器）运行在从节点，并检查主节点的快照版本。如果snappullers发现一个新的集合版本就会把它拉下来，并安装该快照。

   每当一个新的索引搜索器被打开时，预热程序会先于查询程序发生，这对于独立的查询来说是至关重要的。

三个相关参数：

    快照的数量/频率完全取决于索引客户端。因此，集合的数量的版本是由客户端激活的。

   拉取器是配置的。他们可以每秒一次，每天一次，或任何频率。当它们运行时，它们只会撷取最新的从节点没有的集合。

    缓存autowarming是在solrconfig.xml配置文件中被配置的。

如果你希望最新的集合能够近实时的展示，您必须设置既频繁的提交/快照，并且要设置快照拉取器的频率够快。最经常设置的频率并保持良好的性能大概是在1至5分钟的范围内，还要根据预热缓存查询时间及缓存生成时间

缓存可能是至关重要autowarming的性能。一方面，新的缓存版本必须配备足够的项目，以便后续查询将被送达后从缓存系统切换到新版本的集合。另一方面，一个新的集合autowarming （填充）可能需要大量的时间，特别是因为它仅使用一个线程和一个CPU 。如果您的设置消防关闭snapinstaller太频繁，那么一个Solr的奴隶可以在交接查询（旧）收集的不良状况，而升温一个新的集合，第二个“新”人可以被抢购一空，并开始气候变暖！

如果我们试图解决这样的情况，我们将不得不为了使用第二个无效的第一个“新”集合，然后将卡时， “第三个”新收集和温暖，我们将不得不“第二无效“新收藏，并如此循环往复。一个完全回暖的集合，绝不会让它足月前被中止。这是可以预防与适当调整配置，使新的集合不安装过快。

未完待更

Query Response Compression

Compressing the Solr XML response before it is sent back to the client is worthwhile in some circumstances. If responses are very large, and NIC I/O limits are encroached,and Gigabit ethernet is not an option, using compression is a way out.

Compression increases CPU use and since Solr is typically a CPU-bound service, compressiondiminishes query performance. Compression attempts to reduce files to 1/6th original size, and network packets to 1/3rd original size. (We're not taking the time right now to figure out if the big gap between bits and packets makes sense or not, but suffice it to say it's a nice reduction.) Query performance is impacted ~15% on the Solr server.

Consult the documentation for the application server you are using (ie: Tomcat, Resin, Jetty, etc...) for more information on how to configure page compression.

Indexing Performance

In general, adding many documents per update request is faster than one per update request.

For bulk updating from a Java client, (in 3.X) consider using theStreamingUpdateSolrServer.java which streams updates over multiple connections using multiple threads.N.B. In 4.X, the StreamingSolrServer has been deprecated in favour of the ConcurrentUpdateSolrServer.

Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection.

RAM Usage Considerations

OutOfMemoryErrors

If your Solr instance doesn't have enough memory allocated to it, the Java virtual machine will sometimes throw a JavaOutOfMemoryError. There is no danger of data corruption when this occurs, and Solr will attempt to recover gracefully. Any adds/deletes/commits in progress when the error was thrown are not likely to succeed, however. Other adverse effects may arise. For instance, if the SimpleFSLock locking mechanism is in use (as is the case in Solr 1.2), an ill-timed OutOfMemoryError can potentially cause Solr to lose its lock on the index. If this happens, further attempts to modify the index will result in

SEVERE: Exception during commit/optimize:java.io.IOException: Lock obtain timed out: SimpleFSLock@/tmp/lucene-5d12dd782520964674beb001c4877b36-write.lock

errors.

If you want to see the heap when OOM occurs set "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump" (courtesy of Bill Au)

Memory allocated to the Java VM

The easiest way to fight this error, assuming the Java virtual machine isn't already using all your physical memory, is to increase the amount of memory allocated to the Java virtual machine running Solr. To do this for the example/ in the Solr distribution, if you're running the standard Sun virtual machine, you can use the -Xms and -Xmx command-line parameters:

java -Xms512M -Xmx1024M -jar start.jar

Factors affecting memory usage

You may also wish to try to actually reduce Solr's memory usage.

One factor is the size of the input document:

When processing an "add" command for a document, the standard XML update handler has two limitations:

All of the document's fields must simultaneously fit into memory. (Technically, it's actually the sum of min(<the actual field value's length>, maxFieldLength). As such, adjusting maxFieldLength may be of some help.)
- (I'm assuming that fields are truncated to maxFieldLength before being added to the relevant document object. If that's not true, then maxFieldLength won't help here. --ChrisHarris)
Each individual <field>...</field> tag in the input XML must fit into memory, regardless of maxFieldLength.

Note that several different "add" commands can be running simultaneously (in different threads). The more threads, the greater the memory usage.

When indexing, memory usage will grow with the number of documents indexed until a commit is performed. A commit (including a soft commit) will free up almost all heap memory. To avoid very large heaps and associated garbage collection pauses during indexing, perform a manual (soft) commit periodically, or consider enabling autoCommit (or autoSoftCommit) in solrconfig.xml.