I saw a ES documentation of "doc_values" http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html saying
"Doc values are built at index time, not at search time", so what will be built if using doc_values?
"doc values are prebuilt and much faster to initialize", why it is much faster?
"but without the heap memory usage",so using the page cache?
Can somebody explain to me how doc_values is implemented and when should I use?I check my heap usage with jstat periodically,and I can see that I still have a plenty of space to use.
"Doc values are built at index time, not at search time", so what will be built if using doc_values?
There are two types of workloads for which we need a columnar view on top of the data: sorting and aggregations. And in the current version of Elasticsearch there are two cases:
foo -> 0, 1
bar -> 1
will be transformed into the following data-structure
0 -> foo
1 -> foo, bar
"doc values are prebuilt and much faster to initialize", why it is much faster?
This uninversion process that I mentioned is actually very CPU and I/O intensive. The result is put in a cache but the first access is still slow and this will hurt the latency of all queries that run right after a large merge. You could fix this issue by loading fielddata eagerly but even though it will make response times better, it is moving the issue elsewhere and changes to your index will take longer to become visible since elasticsearch will wait for field data to be loaded before the new point-in-time view on the index is available for search.
On the other hand with doc values, you will only need to read some tiny metadata from disk and that's it.
"but without the heap memory usage",so using the page cache?
Exactly! Doc values require very little heap memory, mostly metadata about field field and how things are encoded on disk. The rest is read directly from disk and relies on the filesystem cache for performance.
Can somebody explain to me how doc_values is implemented and when should I use?I check my heap usage with jstat periodically,and I can see that I still have a plenty of space to use.
This is a bit complicated because there are different cases... for instance:
But in practice, the important thing to know is that it's basically a very large mmap'ed file that is read sequentially, so even though disk-based, it's still friendly to your I/O system.
If this is something you are interested in, you can read more about it
Regarding when you should use doc values, I think you should enable doc values on all fields that you plan to sort or aggregate on. There is an ongoing discussion about enabling doc values by default in the next elasticsearch major version.