I'm Studying ElasticSearch and Apache Lucene.
Lately, I found that Apache Lucene has a few Merge Policy but ElasticSearch uses TieredMergePolicy instead of other merge policy such as LogMergePolicy and LogByteSizeMergePolicy.
So I've been searching for information about TieredMergePolicy. I found the algorithm but I couldn't see why TieredMergePolicy is better than others(I mean in general case, not special case)
why is it important to pick similar size segments when segment merging and how it affect the overall performance?
please help me.
In this Chinese post on the topic I found the following statement which provides some insight on the benifit of the TieredMergePolicy
over the LogByteSizeMergePolicy
which was Lucene's default merge policy prior to TieredMergePolicy
:
The difference between
TieredMergePolicy
andLogByteSizeMergepolicy
is that the former can merge non-adjacent segments, and distinguish the maximum number of segments allowed to merge at one timesetMaxMergeAtOnce(int v)
and the maximum number of segments allowed in a layersetSegmentsPerTier(double v)
.
For a more expansive explaination of the TieredMergePolicy
one great source is the comments on github for the class. The following information comes from that source and are available under an Apache 2 license:
Merges segments of approximately equal size, subject to an allowed number of segments per tier.
This is similar to LogByteSizeMergePolicy
, except this merge policy is able to merge
non-adjacent segment, and separates how many segments are merged at once (setMaxMergeAtOnce
) from how many segments are allowed per tier (setSegmentsPerTier
).
This merge policy also does not over-merge (i.e. cascade merges).
For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.
If a merge will produce a segment that's larger than `setMaxMergedSegmentMB`, then the policy will merge fewer segments (down to 1 at once, if that one has deletions) to keep the segment size under budget.
NOTE: this policy freely merges non-adjacent segments; if this is a problem, use `LogMergePolicy`.
NOTE: This policy always merges by byte size of the segments, always pro-rates by percent deletes
NOTE Starting with Lucene 7.5, there are several changes: + `findForcedMerges` and `findForcedDeletesMerges`) respect the max segment size by default. + When `findforcedmerges` is called with `maxSegmentCount` other than 1, the resulting index is not guaranteed to have <= `maxSegmentCount` segments. Rather it is on a "best effort" basis. Specifically the theoretical ideal segment size is calculated and a "fudge factor" of 25% is added as the new `maxSegmentSize`, which is respected. + `findForcedDeletesMerges` will not produce segments greater than `maxSegmentSize`.