pythonpysparkgeohashing

How to Decode GEOHASH Column using PySpark


I'm trying to decode the GEOHASH to Latitude and Longitude using the pygeohash library. Below is my code

import pygeohash as pgh
from pyspark.sql.types import StringType

udf1 = udf(lambda x: pgh.decode(x))
add_latlong = add.withColumn('location', udf1(col('GEOHASH')))

However, I'm getting the result below:

+------------+--------------------+
|     GEOHASH|            location|
+------------+--------------------+
|w284nyv39qzn|[Ljava.lang.Objec...|
|w0zqyr64nt4v|[Ljava.lang.Objec...|
|w2815pb0yfgr|[Ljava.lang.Objec...|
|w281xv1czv1t|[Ljava.lang.Objec...|
|w2r7cvc0m1bz|[Ljava.lang.Objec...|
+------------+--------------------+

I've come across this thread PySpark UDF Returns [Ljava.lang.Object;@] that mentioned to use StringType as the second parameter of the udf but I'm still seeing the same result as above. How do I get the latitude and longitude from here?

Appreciate your help

Update: I've used the solution from Jonathan Lam below and for completeness here's the code and dataframe.

udf1 = udf(lambda x: pgh.decode(x), ArrayType(FloatType()))
add_latlong = add.withColumn('location', udf1(col('GEOHASH'))).withColumn('Lat',col('location')[0]).withColumn('Long',col('location')[1])

+------------+--------------------+--------+----------+
|     GEOHASH|            location|     lat|      long|
+------------+--------------------+--------+----------+
|w2864utg8uyf|[3.189408, 101.73...|3.189408| 101.73035|
|w281hj25hzre|[3.017675, 101.42...|3.017675|101.425995|
|w2830hj8vzrp|[3.010423, 101.60...|3.010423|101.609375|
|w0zf5uepz8uk|[4.596367, 101.06...|4.596367| 101.06768|
|w2rkk6s97gvt|[2.167289, 111.63...|2.167289| 111.63843|
+------------+--------------------+--------+----------+ 

Solution

  • I'm not sure if your case is the same as the link you provided, since you are using external package to do the transformation pgh.decode(x). Based on the docs:

    pgh.decode(geohash='ezs42')
    # >>> ('42.6', '-5.6')
    

    I think you should use ArrayType(FloatType()) instead.