pythonapache-sparkuser-defined-functions

udf returning Ljava.lang.Object;@


I have a PySpark UDF which when I try to apply to each row for one of the df columns and get a new column, I get a [Ljava.lang.Object;@7e44638d (different value after the @ for each row)

Please see the udf below:

def getLocCoordinates(property_address):
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    querystring = {f"address":property_address},"key":"THE_API_KEY"}
    response = requests.get(url, params=querystring)
    response_json = json.loads(response.text)

    for adr in response_json['results']:
        geometry = adr['geometry']
        coor = geometry['location']
        lat = coor['lat']
        lng = coor['lng']
        coors = lat, lng
        return coors

getCoorsUDF = udf(lambda x:getLocCoordinates(x))

df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress") ) )

I tried:

The result looks like so:

Ref Num FullAddress AddressCoordinates
1234 Some Address [Ljava.lang.Object;@

This gets returned for each row in the dataframe.

Initially I was using the function in a Python notebook and it was working fine, lat and lng was returning for each adress. However, I had to move this to PySpark and I am hitting a brick wall here.


Solution

  • I think that you're seeing the [Ljava.lang.Object;@... output because your UDF is returning a Python tuple ((lat, lng)), and PySpark doesn't know how to serialize that into a DataFrame column unless you explicitly define a return schema that Spark understands.

    You should return a StructType with fields for lat and lng. For example you can do something like this:

    from pyspark.sql.functions import udf, col
    from pyspark.sql.types import StructType, StructField, DoubleType
    import requests
    import json
    
    # defining return type for the UDF
    location_schema = StructType([
        StructField("lat", DoubleType(), True),
        StructField("lng", DoubleType(), True)
    ])
    
    def getLocCoordinates(property_address):
        url = "https://maps.googleapis.com/maps/api/geocode/json"
        params = {
            "address": property_address,
            "key": "YOUR_API_KEY"
        }
        try:
            response = requests.get(url, params=params)
            data = response.json()
            if data['results']:
                location = data['results'][0]['geometry']['location']
                return {"lat": location['lat'], "lng": location['lng']}
        except Exception as e:
            print(f"Error: {e}")
        return None
    
    # registering the UDF with schema
    getCoorsUDF = udf(getLocCoordinates, location_schema)
    
    # now you apply the UDF
    df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress")))
    
    # an option would be to extract lat and lng as separate columns
    df = df.withColumn("Latitude", col("AddressCoordinates.lat")) \
           .withColumn("Longitude", col("AddressCoordinates.lng"))