We have run a polybase service to query on our hadoop cluster. Here is the file structure in our hadoop side:
Structure for /warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019-11/base_0000083/bucket_00001 File Version: 0.12 with ORC_135 Rows: 51107 Compression: ZLIB Compression size: 262144 id:int, xxx xxx xxx
There are 16 fields in this side
On the other side, I have created a file format as follows:
CREATE EXTERNAL FILE FORMAT [OrcFileFormat] WITH (
FORMAT_TYPE = ORC,
DATA_COMPRESSION = N'org.apache.hadoop.io.compress.DefaultCodec')
and the external table as below:
CREATE EXTERNAL TABLE [dbo].[test]
(
id:int,
xxx
xxx
xxx
)
WITH (DATA_SOURCE = [hadoop_test],
LOCATION = N'//warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019-
11/base_0000083/bucket_00001'
,FILE_FORMAT = [OrcFileFormat]
,REJECT_TYPE = VALUE,
REJECT_VALUE = 500)
With exactly copied fields and the only change to data types were string to varchar and timestamp to datetime. The table is created successfully; however, when following query is executed:
SELECT TOP (1000) *
FROM [dbo].test
This error is thrown:
Cannot execute the query "Remote Query" against OLE DB provider "MSOLEDBSQL" for linked server "(null)". External file access failed due to internal error: 'File //warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019-11/base_0000083/bucket_00001: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: HadoopExecutionException: Column count mismatch. Source file has 6 columns, external table definition has 16 columns.'
Can someone please help me out?
It occurs when ORC is created as a transaction table. In this situation, ORC files contain some other information like operation, etc. If you open your ORC file, it'll look like as the following format:
(operation, originalTransaction, bucket, rowId, currentTransaction, (ActualData))
where ActualData is your actual data that have been stored in the file. So, as you can see, it seems you have 6 columns against your actual columns count. As Microsoft has said about Polybase limitations here, you will not be able to read ORC transaction files.