amazon-web-servicesamazon-redshiftzalgo

How to handle zalgo text when loading into Redshift


I am getting an error:

String length exceeds DDL length                                                                    

when loading from a CSV into a column in Redshift. I'm COPYING data from a source (which has zalgo text sometimes) into a CSV in s3, and loading the csv into Redshift.

This is my code when extracting data from a source into s3

substring(s.caption for 2000) as caption,
substring(s.location for 2000) as location,

and I use LOAD to move data from s3 to Redshift which is when I get the error. How do I crop the zalgo text properly? Is there a way? What's the root of the issue... why does substring not work?


Solution

  • Common issue. Redshift uses multi-byte UTF-8 encoding for non-ascii characters in varchar type data. The varchar length definition is based on bytes, not characters. When the data is ascii there is no difference so this distinction doesn't matter. But when non-ascii characters are used there is a difference in length.

    Redshift has the functions len() and octet_length() which produce the character length and the byte length of a string - see: https://docs.aws.amazon.com/redshift/latest/dg/r_OCTET_LENGTH.html

    So there are a few ways you can fix this issue:

    1. Change the code producing the S3 file to limit the strings to 200 bytes. This may be easy or hard depending on what tool you are using.
    2. Change the table definition to have enough room for all the multi-byte characters you are likely to see. Since a multi-byte UTF-8 character cannot be more than 4 bytes, multiplying the length by 4 will solve it. But carrying along columns that have been made 4X larger may have a performance issue if these are rare. Also there may only be a few multi-byte characters per any value so 4X is overkill.
    3. Ingest into a staging table that has the column 4X the size and then append the data to your main table limiting the octet_length to 200. You can do this by finding the difference between octet_lengh() and len() in a substring() function.

    =================================================================

    UPDATE:

    A lot of discussion in the comments and this should be summarized and clarified.

    Redshift encodes varchar text in multi-byte UFT-8 as does much of the internet. This format is a variable length encoding that can be as small as 1 byte per character and as much as 4 bytes per character. The size depends on the character. Basic ACSII characters are 1 byte, other characters (and symbols) take more than 1 byte to store. See https://en.wikipedia.org/wiki/UTF-8 for a detailed description of UTF8 and how non-ascii characters are encoded. See https://design215.com/toolbox/ascii-utf8.php for a table of all the 1 and 2 bytes UTF8 characters.

    Also form Wikipedia:

    UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 97.9% of all web pages, over 99.0% of the top 10,000 pages, and up to 100% for many languages, as of 2023. Virtually all countries and languages have 95% or more use of UTF-8 encodings on the web.

    So any string of 200 characters can be of "byte size" anywhere between 200 bytes (all ascii characters) and 800 bytes (all characters that require 4 bytes to encode). I've never worked with zalgo text so I don't know it's UTF-8 encoding nor how many bytes it takes per character. Typically most string only have a limited number of characters that require 4 bytes but it doesn't happen especially with some rarer eastern character sets. Your situation may be unique.

    Varchar lengths in Redshift are in bytes, not characters. This distinction is not important when text is all ascii. So having multi-byte characters in a string can cause an unexpected errors. This seems to be what you are running into.

    All string functions in Redshift operate on characters, except octet_length() which returns a strings length in bytes. If you have a fixed varchar column size and you need to fit a string that is too long (in terms of bytes), you will need to trim the string down to fit. You will also want to preserve as much of the original string as possible (or at least I would). The challenge is to trim the minimum number of characters from your string, using the SQL functions available, such that the resulting string will fit in varchar(200) (200 bytes).

    Let's assume that your S3 file has data that is of 200 characters or less. Also that your target table for this data has a varchar(200) column for this data to be inserted into. We can start by COPYing the S3 data into a staging table, "stage", that has a column, "txt", that is defined as varchar(800) - 200 characters X 4 bytes per characters worst case.

    Now a common way to handle this in the case where a small subset of characters are more than 1 byte in size is to trim the number of characters down by how much the octet_length() is over the desired length of 200. Which would look like:

    select substring(txt, 1, 200 - (octet_length(txt) - len(txt))) as txt
    from stage;
    

    The problem with this is when the string has many multi-byte characters the result in zero length strings being produced. So a more sophisticated approach is needed.

    To complicate things we don't know where in the string the multi-byte characters are positioned. At the beginning, the end, or spread out. But we don't want the SQL to fail with errors no matter what string we feed it.

    One approach would be to use a proportional approach - if the string is 2X the desired size cut it in half. This will get the resulting string close to the desired byte length but since the multi-byte characters could all be at the beginning this may still be too large. A second step will be needed to fix up this corner case.

    An approach like this might work (off the cuff):

    with cte1 as (
    select txt, 
      decode(octet_length(txt) < 200, len(txt), ((octet_length(txt) * ( 200::decimal(8,4) / octet_length((txt)))::int) as last_char
      -- FYI DECODE() is a simple version of CASE 
    from stage ),
    cte2 as (
    select substring(txt, 1, last_char) as txt
    from cte1)
    select substring(txt, 1, 200 - (octet_length(txt) - len(txt))) as txt
    from stage;
    

    (that code is untested so sorry for any typos / bugs)

    The first CTE just finds the proportion of the string that will approximately fit in 200 bytes. The second CTE trims the string to this length. The top select does the final trim of characters to ensure the result is < 200 bytes.

    There may be others with better answers for how to keep the max number of characters while not going over 200 bytes. This could be its own separate question on SO.