I have a PySpark UDF which when I try to apply to each row for one of the df columns and get a new column, I get a [Ljava.lang.Object;@7e44638d (different value after the @ for each row)
Please see the udf below:
def getLocCoordinates(property_address):
url = "https://maps.googleapis.com/maps/api/geocode/json"
querystring = {f"address":property_address},"key":"THE_API_KEY"}
response = requests.get(url, params=querystring)
response_json = json.loads(response.text)
for adr in response_json['results']:
geometry = adr['geometry']
coor = geometry['location']
lat = coor['lat']
lng = coor['lng']
coors = lat, lng
return coors
getCoorsUDF = udf(lambda x:getLocCoordinates(x))
df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress") ) )
I tried:
getCoorsUDF = udf(getLocCoordinates, FloatType()) --> returns NULL for each row of the newly create "AddressCoordinates" column.
getCoorsUDF = udf(getLocCoordinates, StringType()) --> returns [Ljava.lang.Object;@
getCoorsUDF = udf(getLocCoordinates) --> returns [Ljava.lang.Object;@
The result looks like so:
Ref Num | FullAddress | AddressCoordinates |
---|---|---|
1234 | Some Address | [Ljava.lang.Object;@ |
This gets returned for each row in the dataframe.
Initially I was using the function in a Python notebook and it was working fine, lat and lng was returning for each adress. However, I had to move this to PySpark and I am hitting a brick wall here.
I think that you're seeing the [Ljava.lang.Object;@...
output because your UDF is returning a Python tuple ((lat, lng)
), and PySpark doesn't know how to serialize that into a DataFrame column unless you explicitly define a return schema that Spark understands.
You should return a StructType
with fields for lat
and lng
. For example you can do something like this:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StructType, StructField, DoubleType
import requests
import json
# defining return type for the UDF
location_schema = StructType([
StructField("lat", DoubleType(), True),
StructField("lng", DoubleType(), True)
])
def getLocCoordinates(property_address):
url = "https://maps.googleapis.com/maps/api/geocode/json"
params = {
"address": property_address,
"key": "YOUR_API_KEY"
}
try:
response = requests.get(url, params=params)
data = response.json()
if data['results']:
location = data['results'][0]['geometry']['location']
return {"lat": location['lat'], "lng": location['lng']}
except Exception as e:
print(f"Error: {e}")
return None
# registering the UDF with schema
getCoorsUDF = udf(getLocCoordinates, location_schema)
# now you apply the UDF
df = df.withColumn("AddressCoordinates", getCoorsUDF(col("FullAddress")))
# an option would be to extract lat and lng as separate columns
df = df.withColumn("Latitude", col("AddressCoordinates.lat")) \
.withColumn("Longitude", col("AddressCoordinates.lng"))