amazon-redshift-spectrum

Redshift spectrum shows NULL values for all rows


When I run this query in Athena query editor, it works as expected.

SELECT * FROM "sampledb"."elb_logs" limit 10;

elb_logs table has been generated based on the official tutorial. When I try to use spectrum in redshift, I can see all "NULL" values for all columns. I am creating athena_schema using these commands:

drop schema "athena_schema";

create external schema athena_schema from data catalog 
database 'sampledb' 
iam_role 'arn:aws:iam::94331XXXXXXX:role/RedshiftCopyUnload'
region 'ap-south-1';

And the output of system table:

select * from svv_external_tables;

schemaname  tablename   location    input_format    output_format   serialization_lib   serde_parameters    compressed  parameters

athena_schema   elb_logs    s3://athena-examples-ap-south-1/elb/plaintext   org.apache.hadoop.mapred.TextInputFormat    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  org.apache.hadoop.hive.serde2.RegexSerDe    {"input.regex":"([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*)     0   {"EXTERNAL":"TRUE","transient_lastDdlTime":"1480278335"}

I am not sure why the athena console shows the correct values for all columns while redshift shows all NULLs?


Solution

  • This was because elb_logs table was using regular expression serialization that spectrum can not handle. I converted the table into parquet file format using this command.

    CREATE TABLE elb_logs3
    WITH (
          format = 'PARQUET',
          parquet_compression = 'SNAPPY',
          external_location = 's3://elb163/parqfiles'
    ) AS SELECT * from elb_logs
    

    Now athena will have 2 tables "elb_logs" and "elb_logs3". Once I create the external schema using standard commands like this...

    drop schema "athena_schema";
    
    create external schema athena_schema from data catalog 
    database 'sampledb' 
    iam_role 'arn:aws:iam::XXX:role/RedshiftCopyUnload'
    region 'us-east-1';
    

    I can now select records from elb_logs table like this...

    select * from athena_schema.elb_logs3 limit 10;

    Note that selecting from elb_logs table still shows NULL values for all columns.