scalaapache-sparkdataframesha2

How to convert a hex decimal column in scala to int


I tried to use conv function as I saw in some examples, but is not working for me. I don't understand why this function returns the same value in the column of my DF. I was using spark2.1 with scala 2.11.11, and then I tried with spark 2.2 and scala 2.11.11 too. But when I used the conv function applied to my SHA2, it is not working as expected. My code is:

val newDf = Df.withColumn("id",conv(sha2(col("id"),256),16,10).cast(IntegerType))

Any advice? Thank you very much!


Solution

  • Unfortunately, there isn't a good solution for this using the conv function in Spark. This is because the 256 bit hash from SHA2 is too long to be parsed as an integer in Java/Scala. Further, IntegerType, like the underlying Scala int is 32 bits. So even if the conv function was doing something clever in the conversion that allowed it to handle larger numbers, the resulting cast would still fail.

    If you remove the cast to IntegerType, you will see is that the result returned by the conv function is 18446744073709551615 regardless of the input value. This is 2^64-1, the max unsigned 8 byte integer value. This value can't be successfully cast to IntegerType or LongType, so the cast ends up returning a null.

    If you want to really dig in, you can see in the implementation of Spark's NumberConverter class which is used by the conv SQL function, it does the conversion going through a 64-bit unsigned int https://github.com/apache/spark/blob/f07c5064a3967cdddf57c2469635ee50a26d864c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala#L143.

    The best you can probably do is to write a UDF and do some clever math to break up the value in lower order and higher order components that can be converted and then reconstituted to handle the conversion if you really need to view the hash as an integer.