I have two servers. The first server (A) contains the zookeeper, a mongodb database and a drillbit. The second server (B) contains a hadoop distribution with several hive tables, a postgresql database and the other drillbit. Both drillbits can see eachother on the drill homepages, since they are both connected to the zookeeper on server A. When a query like the one below is run (the view dfs.lineorder consists of the hive and postgresql data on server B), drillbit B becomes the foreman and executes everything. It does not delegate the parts dealing with mongodb tables to the drillbit on server A, so 90% of the time required to run the query (approx. 30 minutes) is spent on sending the mongodb collection from server A to server B. Is there a way to force drillbit B to delegate the mongodb part of the query to drillbit B, or could this be a misconfiguration problem? Also, can drillbits be configured to access different databases (e.g. when each drillbit has access to a distinct subnet, so that not all drillbits can access the same databases)?
Sample query:
SELECT SUM(revenue) AS revenue
FROM (
SELECT SUM(lo_extendedprice*lo_discount) AS revenue
FROM dfs.tmp.lineorder, dfs.tmp.`date`
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
UNION ALL
SELECT SUM(lo_extendedprice * lo_discount) AS revenue
FROM mongo.test.ssb_europe ssb
WHERE ssb.orderdate.d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
);
Configuration (drill-override.conf):
On drillbit A (Windows Server 2008): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }
On drillbit B (Cloudera CDH 5.8.0): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }
It appears that the MongoDB storage plugin could be the problem. If the query is clearly separated into two complete subqueries (with grouping), the optimizer splits the work successfuly. The following query splits the work between the two drillbits correctly.
SELECT SUM(lo_revenue), d_year, p_brand1
FROM (
SELECT SUM(lo_revenue) AS lo_revenue, d_year, p_brand1
FROM dfs.tmp.lineorder, dfs.tmp.`date`, dfs.tmp.part, dfs.tmp.supplier
WHERE lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12'
AND s_region = 'AMERICA'
GROUP BY d_year, p_brand1
UNION ALL
SELECT SUM(TO_NUMBER(lo_revenue, '########.##')) AS lo_revenue, ssb.orderdate.d_year AS d_year, ssb.part.p_brand1 AS p_brand1
FROM mongo.test.ssb_europe ssb
WHERE IsNumeric(lo_revenue)
AND ssb.part.p_category = 'MFGR#12'
AND ssb.supplier.s_region = 'AMERICA'
GROUP BY ssb.orderdate.d_year, ssb.part.p_brand1
) l
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;