elasticsearchhivequery-optimizationelasticsearch-hadoop

Data load from HDFS to ES taking very long time


I have created an external table in hive and need to move the data to ES (of 2 nodes, each with 1 TB). Below regular query taking very long time (more than 6 hours) for a source table with 9GB of data.

INSERT INTO TABLE <ES_DB>.<EXTERNAL_TABLE_FOR_ES> 
SELECT COL1, COL2, COL3..., COL10 
  FROM <HIVE_DB>.<HIVE_TABLE>;

ES index is having default 5 shards and 1 replica. Increasing the number of shards could any way speed up the ingestion? Could some one suggest any improvements to speed up the ES node ingestion.


Solution

  • You don't mention the methodology you're using to feed the data into ES so it's hard to see if you're using an ingestion pipeline or what technology to bridge the gap. Given that, I'll stick with generic advice on how to optimize ingestion into Elasticsearch.

    Elastic has published some guidance for optimizing systems for ingestion, and there are three points that we've found do make a real difference:

    1. Turn Off Replicas: Set the number of replicas to zero while injesting the data to eliminate the need to copy the data while also injesting it. This is an index-level setting ("number_of_replicas")
    2. Don't Specify an ID: It isn't clear from your database schema if you are mapping across any identifiers, but if you can avoid specifying a document Id to Elastic and let it specify its own that significantly improves performance.
    3. Use Parallel Bulk Operators: Use the BulkAPI to push data into ES and feed it with multiple threads so it always has more than one Bulk request to work on server-side.

    Finally, have you installed Kibana and monitored your nodes to know what they are limited by? In particular CPU or Memory?