apache-drillcache-locality

Apache Drill database and data locality


I have two servers. The first server (A) contains the zookeeper, a mongodb database and a drillbit. The second server (B) contains a hadoop distribution with several hive tables, a postgresql database and the other drillbit. Both drillbits can see eachother on the drill homepages, since they are both connected to the zookeeper on server A. When a query like the one below is run (the view dfs.lineorder consists of the hive and postgresql data on server B), drillbit B becomes the foreman and executes everything. It does not delegate the parts dealing with mongodb tables to the drillbit on server A, so 90% of the time required to run the query (approx. 30 minutes) is spent on sending the mongodb collection from server A to server B. Is there a way to force drillbit B to delegate the mongodb part of the query to drillbit B, or could this be a misconfiguration problem? Also, can drillbits be configured to access different databases (e.g. when each drillbit has access to a distinct subnet, so that not all drillbits can access the same databases)?

Sample query:

SELECT SUM(revenue) AS revenue
FROM (
    SELECT SUM(lo_extendedprice*lo_discount) AS revenue
    FROM dfs.tmp.lineorder, dfs.tmp.`date`
    WHERE lo_orderdate = d_datekey
    AND d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25
UNION ALL
    SELECT SUM(lo_extendedprice * lo_discount) AS revenue
    FROM mongo.test.ssb_europe ssb
    WHERE ssb.orderdate.d_year = 1993
    AND lo_discount BETWEEN 1 AND 3
    AND lo_quantity < 25
);

Configuration (drill-override.conf):

On drillbit A (Windows Server 2008): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }

On drillbit B (Cloudera CDH 5.8.0): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }


Solution

  • It appears that the MongoDB storage plugin could be the problem. If the query is clearly separated into two complete subqueries (with grouping), the optimizer splits the work successfuly. The following query splits the work between the two drillbits correctly.

    SELECT SUM(lo_revenue), d_year, p_brand1
    FROM (
        SELECT SUM(lo_revenue) AS lo_revenue, d_year, p_brand1
        FROM dfs.tmp.lineorder, dfs.tmp.`date`, dfs.tmp.part, dfs.tmp.supplier
        WHERE lo_orderdate = d_datekey
        AND lo_partkey = p_partkey
        AND lo_suppkey = s_suppkey
        AND p_category = 'MFGR#12'
        AND s_region = 'AMERICA'
        GROUP BY d_year, p_brand1
    UNION ALL
        SELECT SUM(TO_NUMBER(lo_revenue, '########.##')) AS lo_revenue, ssb.orderdate.d_year AS d_year, ssb.part.p_brand1 AS p_brand1
        FROM mongo.test.ssb_europe ssb
        WHERE IsNumeric(lo_revenue)
        AND ssb.part.p_category = 'MFGR#12'
        AND ssb.supplier.s_region = 'AMERICA'
        GROUP BY ssb.orderdate.d_year, ssb.part.p_brand1
    ) l
    GROUP BY d_year, p_brand1
    ORDER BY d_year, p_brand1;