amazon-web-services amazon-s3 amazon-redshift

Copy Data From S3 to Redshift [Precision issue in numeric data]

I am copying data from a text file into redshift table using the following command:-

COPY redshift_table_name FROM 's3://gamma-audit-calculation-output-ngr-data-json/2021/05/10/08/kinesis-calculation-output-ngr-data-1-2021-05-10-09-48-24-82ecea90-ef50-4907-82d7-8b162ca2b841' CREDENTIALS iam_role json 'auto';

Attaching the file present in the path specified.

The data present in file is:-

{"totalgross":6113.47,"totalnetpay":3661.6,"calculationtime":"05/10/2021 02:48:24 AM PDT","dynamicngrlaunched":true,"employeeid":"881448","totalanytimepaywithdrawals":6.62,"totalimputedincome":12.1,"paycheckdate":"2021-04-30","calculationtimeepochmillis":"1620640104258","ngr":0.60,"totalanytimepayrepayments":0.0,"otherrepayments":0.0,"payenddate":"2021-04-30","employeeid_calculationtimeepochmillis":"881448_1620640104258"}

The schema for my redshift table is:-

create table table_name ( employeeid varchar(65535), ngr numeric(17, 2), totalgross numeric(17, 2), totalnetpay numeric(17, 2), earningamount numeric(17, 2), totalimputedincome numeric(17, 2), totalanytimepaywithdrawals numeric(17, 2), totalanytimepayrepayments numeric(17, 2), dynamicngrlaunched boolean, paycheckdate varchar(65535), payenddate varchar(65535), calculationtime varchar(65535), otherRepayments numeric(17, 2), calculationtimeepochmillis bigint, employeeid_calculationtimeepochmillis varchar(65535) ) DISTKEY (employeeid) SORTKEY (calculationtimeepochmillis);

Here the problem I am facing is that the ngr value while getting saved to Redshift table changes to 0.59 instead of 0.60. How can this be possible?

Solution

(
  employeeid varchar(65535),
  ngr numeric(17, 2),
  totalgross numeric(17, 2),
  totalnetpay numeric(17, 2),
  earningamount numeric(17, 2),
  totalimputedincome numeric(17, 2),
  totalanytimepaywithdrawals numeric(17, 2),
  totalanytimepayrepayments numeric(17, 2),
  dynamicngrlaunched boolean,
  paycheckdate varchar(65535),
  payenddate varchar(65535),
  calculationtime varchar(65535),
  otherRepayments numeric(17, 2),
  calculationtimeepochmillis bigint,
  employeeid_calculationtimeepochmillis varchar(65535)
)
DISTKEY (employeeid)
SORTKEY (calculationtimeepochmillis);

Before getting onto anything else, I would advise you in the strongest possible terms NOT to use maximum length varchar. Last I knew, when rows are brought into memory, they use an amount of memory equal to their maximum length, as specified in the DDL. You have five varchar(65535), so one row of your table is using 320 kilobytes of memory.

Remember the available memory is divided up into queues and slots and then across slices, so you may have really not very much memory available - it could vary hugely but it could well be something like 100mb in total - and if you're going to do hash joins, you need to ensure the smaller table in the hash join can when hashed fit into memory, or performance will go to hell. If you have a query running, it will need memory for other things, so if you do have say 100mb, you might have say at most half available for your hash, and 50mb when you have 320kb rows gives you a maximum of about one hundred and fifty rows in your table. You can of course blow right through this - Redshift won't stop you, it won't warn you in any way - but performance will go to hell and you'll have no idea why.

Also be careful with your numerics not to go beyond a precision of 19. When precision is 19 or less, numeric is eight bytes, but when 20 or more, numeric becomes sixteen bytes (regardless of the value you actually store) and has to be processed by a math library, rathar than directly by the processor hardware.

Also, remember to use NOT NULL where possible, since it reduces the size of a column. This is particulaly important for boolean, which is one bit per value when NOT NULL, but two bits per value when NULL, and for varchar, as being NULL adds one byte to the size of data stored for a string.

Finally, you're not setting any encodings. Redshift will choose them for you, but it does a terrible job of picking encodings. I would strongly advise you to pick your own encodings.

Now, on to your problem.

Here the problem I am facing is that the ngr value while getting saved to Redshift table changes to 0.59 instead of 0.60. How can this be possible?

I may be wrong, I'd need to test to check, but I might guess the number is being read first as a float, and then converted to a numeric.

Integers (which is what numeric is, under the hood) and floating point numbers behave differently.

Integers are exact. Floating point numbers are not. By this I mean to say that when you store an integer, you will get back, always, exactly the number you stored. This is not the case with floating point numbers. If you imagine the continuum of numbers between the smallest and largest floating point number as a picket fence, so you have the fence which consists every now and then of a post which goes into the earth, only the numbers at the posts can be stored; so when you store a number, it is converted to the nearest storable number, and that is what is stored, and that's what you get back.

So when you store 0.60, there is no "post" at 0.60 - the nearest is at 0.59, and so 0.59 is what is stored, and that's what you get back when you read the number.

If you want the number to be exact, you could multiply your numbers by powers of 10, so that the fractional part is always zero, and then store them as integers. So in your case with 0.59, if I assume all your number have two decimal places of fractional part, multiply your numbers 100, so 0.59 becomes 59, and then store 59 as an integer. Do all your math using integers, and then finally convert back to floating point at the very last stage.

There is a famous white paper by David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic", which explains the issue;

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html