sqlhadoophivepolybase

PolyBase create an external table to query Hadoop, column count mismatch


We have run a polybase service to query on our hadoop cluster. Here is the file structure in our hadoop side:

Structure for /warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019-11/base_0000083/bucket_00001 File Version: 0.12 with ORC_135 Rows: 51107 Compression: ZLIB Compression size: 262144 id:int, xxx xxx xxx

There are 16 fields in this side

On the other side, I have created a file format as follows:

    CREATE EXTERNAL FILE FORMAT [OrcFileFormat] WITH (
    FORMAT_TYPE = ORC,
     DATA_COMPRESSION = N'org.apache.hadoop.io.compress.DefaultCodec')
   

and the external table as below:

    CREATE EXTERNAL TABLE [dbo].[test]
    (
        id:int,
    xxx
    xxx
    xxx
    )
    WITH (DATA_SOURCE = [hadoop_test],
    LOCATION = N'//warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019- 
     11/base_0000083/bucket_00001'
    ,FILE_FORMAT = [OrcFileFormat]
    ,REJECT_TYPE = VALUE,
    REJECT_VALUE = 500)
   

With exactly copied fields and the only change to data types were string to varchar and timestamp to datetime. The table is created successfully; however, when following query is executed:

     SELECT TOP (1000) *
      FROM [dbo].test

This error is thrown:

Cannot execute the query "Remote Query" against OLE DB provider "MSOLEDBSQL" for linked server "(null)". External file access failed due to internal error: 'File //warehouse/tablespace/managed/hive/xxxxxxx/dxxxxxx/created_month=2019-11/base_0000083/bucket_00001: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: HadoopExecutionException: Column count mismatch. Source file has 6 columns, external table definition has 16 columns.'

Can someone please help me out?


Solution

  • It occurs when ORC is created as a transaction table. In this situation, ORC files contain some other information like operation, etc. If you open your ORC file, it'll look like as the following format:

    (operation, originalTransaction, bucket, rowId, currentTransaction, (ActualData))

    where ActualData is your actual data that have been stored in the file. So, as you can see, it seems you have 6 columns against your actual columns count. As Microsoft has said about Polybase limitations here, you will not be able to read ORC transaction files.