vora

How to Improve Vora Performance


I have been running some tests in both Vora and Hive from the Sap Spark Controller as well as a Base Spark Server. Both the Controller and the Spark Thrift server have the same configurations.

12 Column
10M row table
680Mb

Both Spark Server and SAP Controller are started with --master YARN and the same number of executors,executor memory and cores. The Controller and The Thrift Server are found on the same server in the Hadoop Cluster, I run one test shutdown that Controller/Thriftserver, then startup another to test.

All Numbers Below are from the Thrift Server Job Completion Time or SAP Controller Job Completion Time, I am not waiting for the results to show in HANA or in Beeline or Spark-Shell.

Results:

Spark-Shell -> Spark Thriftserver -> Hive
Select Column returns in : 13s
Count returns in : 1.2s

Spark-Shell -> Spark Thriftserver -> Vora
Select Column returns in : 5s
Count returns in : 100ms

Hana -> Sap Controller -> Hive
Select Column returns in : 45s
Count returns in : 4s

Hana -> Sap Controller -> Vora
Select Column returns in : 24s
Count returns in : 2.1s

Beeline -> Spark Thriftserver -> Hive
Select Column returns in : 35s
Count returns in : 1.9s

Beeline -> Spark Thriftserver -> Vora
Select Column returns in : 55s
Count returns in : 1.2s

Are there any important performance tuning tips to help the controller? The fact that I can select from Hive at a faster speed than the Controller can from Vora is interesting.


Solution

  • After a bit of Partitioning changes. I have gotten SAP Controller to select the data at a faster rate from Hive, Vora still is about the same speed. It seems that smaller number of splits helps the Controller tremendously Splitting the data from 31 to 10 files decreases the query time by more than 75%

    current results:

    Spark-Shell -> Spark Thriftserver -> Hive
    Select Column returns in : 14s
    Count returns in : 1s

    Hana -> Sap Controller -> Hive
    Select Column returns in : 10s
    Count returns in : 5s

    Beeline -> Spark Thriftserver -> Hive
    Select Column returns in : 7s
    Count returns in : 1.3s

    The count seems to return slowly still but not a problem.