sqlapache-drill

Apache Drill: Query fails with Table but works with View


I am currently trying to do some TPC-H benchmarks using Apache Drill. I've noticed some weird behaviour depending on whether I use Tables or Views.

If I define my data using view as follows:

create view if not exists customer as select
    cast(columns[0] as bigint) c_custkey,
    cast(columns[1] as char(25)) c_name,
    cast(columns[2] as varchar(40)) c_address,
    cast(columns[3] as int) c_nationkey,
    cast(columns[4] as char(15)) c_phone,
    cast(columns[5] as double) c_acctbal,
    cast(columns[6] as char(10)) c_mktsegment,
    cast(columns[7] as varchar(101)) c_comment
from dfs.`/opt/tpch/data/customer.tbl`;

create view if not exists orders as select
    cast(columns[0] as bigint) o_orderkey,
    cast(columns[1] as bigint) o_custkey,
    cast(columns[2] as char(1)) o_orderstatus,
    cast(columns[3] as double) o_totalprice,
    cast(columns[4] as date) o_orderdate,
    cast(columns[5] as char(15)) o_orderpriority,
    cast(columns[6] as char(15)) o_clerk,
    cast(columns[7] as int) o_shippriority,
    cast(columns[8] as varchar(79)) o_comment
from dfs.`/opt/tpch/data/orders.tbl`;

create view if not exists lineitem as select
    cast(columns[0] as bigint) l_orderkey,
    cast(columns[1] as bigint) l_partkey,
    cast(columns[2] as bigint) l_suppkey,
    cast(columns[3] as int) l_linenumber,
    cast(columns[4] as double) l_quantity,
    cast(columns[5] as double) l_extendedprice,
    cast(columns[6] as double) l_discount,
    cast(columns[7] as double) l_tax,
    cast(columns[8] as char(1)) l_returnflag,
    cast(columns[9] as char(1)) l_linestatus,
    cast(columns[10] as date) l_shipdate,
    cast(columns[11] as date) l_commitdate,
    cast(columns[12] as date) l_receiptdate,
    cast(columns[13] as char(25)) l_shipinstruct,
    cast(columns[14] as char(10)) l_shipmode,
    cast(columns[15] as varchar(44)) l_comment
from dfs.`/opt/tpch/data/lineitem.tbl`;

And then for example execute the following query (tpc-h, query 3) using the $DRILL_HOME/bin/drill-conf -u "jdbc:drill:zk=node1;schema=dfs.tmp" -log "result.txt" -f "q3.sql":

select 
    l_orderkey, 
    sum(l_extendedprice * (1 - l_discount)) as revenue, 
    o_orderdate, 
    o_shippriority 
from 
    customer, 
    orders, 
    lineitem 
where 
    c_mktsegment = 'BUILDING' 
    and c_custkey = o_custkey 
    and l_orderkey = o_orderkey 
    and o_orderdate < date '1995-03-15' 
    and l_shipdate > date '1995-03-15' 
group by 
    l_orderkey, 
    o_orderdate, 
    o_shippriority 
order by 
    revenue desc, 
    o_orderdate 
LIMIT 10;
I get the correct expected result.

However, if I now use table instead of view for my data definitions (i.e., everything the same as above just replace view with table), and execute the exact same query as before, I now get an error and it does not return any result:

Error: VALIDATION ERROR: From line 11, column 2 to line 11, column 13: Column 'c_mktsegment' is ambiguous


[Error Id: c929bf25-df6b-40c0-9866-b606f769c410 ] (state=,code=0)
Aborting command set because "force" is false and command failed: "select 
    l_orderkey, 
    sum(l_extendedprice * (1 - l_discount)) as revenue, 
    o_orderdate, 
    o_shippriority 
from 
    customer, 
    orders, 
    lineitem 
where 
    c_mktsegment = 'BUILDING' 
    and c_custkey = o_custkey 
    and l_orderkey = o_orderkey 
    and o_orderdate < date '1995-03-15' 
    and l_shipdate > date '1995-03-15' 
group by 
    l_orderkey, 
    o_orderdate, 
    o_shippriority 
order by 
    revenue desc, 
    o_orderdate 
LIMIT 10;"

I know that I could just rewrite all my queries and add the table names in front of each column name, but that is cumbersome and should not be necessary.

Anyone have any idea what could be causing this issue? And how to fix it?
Thanks for any help.

System:


Solution

  • The short version of this is that Drill was designed to be used without requiring a pre-defined schema. If you understand the ins and outs of this, it works pretty well, however, there are situations like this where it might not behave as you'd expect.

    Basically what is happening here is that Drill doesn't have the schema information avaialble during query planning and as a result doesn't know what table the fields are in. (Note that if you are querying a database or something that provides metadata, this doesn't happen). Anyway, you have a few options:

    1. Rewrite the query with table names.
    2. Use Drill's Metastore after creating the files. If you use Drill's metastore, the schema will be available during planning and you will get better performance and you likely won't have to specify table names. (https://drill.apache.org/docs/using-drill-metastore/)

    The reason your query works for views is that Drill views actually do store schema information, whereas files do not.