I am currently trying to do some TPC-H benchmarks using Apache Drill. I've noticed some weird behaviour depending on whether I use Tables or Views.
If I define my data using view as follows:
create view if not exists customer as select
cast(columns[0] as bigint) c_custkey,
cast(columns[1] as char(25)) c_name,
cast(columns[2] as varchar(40)) c_address,
cast(columns[3] as int) c_nationkey,
cast(columns[4] as char(15)) c_phone,
cast(columns[5] as double) c_acctbal,
cast(columns[6] as char(10)) c_mktsegment,
cast(columns[7] as varchar(101)) c_comment
from dfs.`/opt/tpch/data/customer.tbl`;
create view if not exists orders as select
cast(columns[0] as bigint) o_orderkey,
cast(columns[1] as bigint) o_custkey,
cast(columns[2] as char(1)) o_orderstatus,
cast(columns[3] as double) o_totalprice,
cast(columns[4] as date) o_orderdate,
cast(columns[5] as char(15)) o_orderpriority,
cast(columns[6] as char(15)) o_clerk,
cast(columns[7] as int) o_shippriority,
cast(columns[8] as varchar(79)) o_comment
from dfs.`/opt/tpch/data/orders.tbl`;
create view if not exists lineitem as select
cast(columns[0] as bigint) l_orderkey,
cast(columns[1] as bigint) l_partkey,
cast(columns[2] as bigint) l_suppkey,
cast(columns[3] as int) l_linenumber,
cast(columns[4] as double) l_quantity,
cast(columns[5] as double) l_extendedprice,
cast(columns[6] as double) l_discount,
cast(columns[7] as double) l_tax,
cast(columns[8] as char(1)) l_returnflag,
cast(columns[9] as char(1)) l_linestatus,
cast(columns[10] as date) l_shipdate,
cast(columns[11] as date) l_commitdate,
cast(columns[12] as date) l_receiptdate,
cast(columns[13] as char(25)) l_shipinstruct,
cast(columns[14] as char(10)) l_shipmode,
cast(columns[15] as varchar(44)) l_comment
from dfs.`/opt/tpch/data/lineitem.tbl`;
And then for example execute the following query (tpc-h, query 3) using the $DRILL_HOME/bin/drill-conf -u "jdbc:drill:zk=node1;schema=dfs.tmp" -log "result.txt" -f "q3.sql"
:
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
LIMIT 10;
I get the correct expected result.
However, if I now use table instead of view for my data definitions (i.e., everything the same as above just replace view
with table
), and execute the exact same query as before, I now get an error and it does not return any result:
Error: VALIDATION ERROR: From line 11, column 2 to line 11, column 13: Column 'c_mktsegment' is ambiguous
[Error Id: c929bf25-df6b-40c0-9866-b606f769c410 ] (state=,code=0)
Aborting command set because "force" is false and command failed: "select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < date '1995-03-15'
and l_shipdate > date '1995-03-15'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
LIMIT 10;"
I know that I could just rewrite all my queries and add the table names in front of each column name, but that is cumbersome and should not be necessary.
Anyone have any idea what could be causing this issue? And how to fix it?
Thanks for any help.
System:
The short version of this is that Drill was designed to be used without requiring a pre-defined schema. If you understand the ins and outs of this, it works pretty well, however, there are situations like this where it might not behave as you'd expect.
Basically what is happening here is that Drill doesn't have the schema information avaialble during query planning and as a result doesn't know what table the fields are in. (Note that if you are querying a database or something that provides metadata, this doesn't happen). Anyway, you have a few options:
The reason your query works for views is that Drill views actually do store schema information, whereas files do not.