hadoopimpalautl-file

HPL/SQL UTL_FILE.PUT_LINE Puts NULL Character Between Each Character


I'm writing text to a file directly to HDFS using HPL/SQL's UTL_FILE function PUT_LINE(). Each line in the file consists of several text fields delimited by a semicolon.

Note:

  1. When I "hadoop copyToLocal" and open that file in vi, I see one NULL (^@) character between each output character in the file.
  2. file -i displays "application/octet-stream; charset=binary"
  3. Linux env returns LANG=en_US.UTF-8
  4. CREATE EXTERNAL TABLE in ImpalaSQL displays the table correctly, but...
  5. One column, DATE_ID, is a string in YYYY-MM-DD format. When I SELECT LENGTH(DATE_ID), I get back a length of 21, not a length of 10 (both before and after TRIM()).
  6. Specifying tblproperties('serialization,encoding'='UTF-8') or tblproperties('serialization,encoding'='UTF-16') makes no difference, I still get back a length of 21.
  7. The other columns, not just DATE_ID, have the same issue.
  8. Replacing fields terminated by ';' with fields terminated by '\u003B' (unicode semicolon) makes no difference.

Any ideas as to how to read in the external data while in this state, or prevent it being written out via PUT_LINE() in this state?


Solution

  • I haven't found a way to use the serialization.encoding option to fix the issue, but two workarounds:

    1. Use PRINT() or DBMS_OUTPUT.PUT_LINE() to write out a text string to the Linux file system and then push it over to HDFS.

    2. Use REGEXP_REPLACE to remove the null characters in each column:

      regexp_replace(column-name,'\x00','')