I am using Pandas and PyProj to convert eastings and northing to longitutde and latitude and then save the split output into 2 columns like this....
v84 = Proj(proj="latlong",towgs84="0,0,0",ellps="WGS84")
v36 = Proj(proj="latlong", k=0.9996012717, ellps="airy",
towgs84="446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894")
vgrid = Proj(init="world:bng")
def convertLL(row):
easting = row['easting']
northing = row['northing']
vlon36, vlat36 = vgrid(easting, northing, inverse=True)
converted = transform(v36, v84, vlon36, vlat36)
row['longitude'] = converted[0]
row['latitude'] = converted[1]
return row
values = pd.read_csv("values.csv")
values = values.apply(convertLL, axis=1)
This is working but is very slow and times out on larger datasets. In an effort to improve things I am trying to convert this to use a lamba function instead in the hopes that will speed things up. I have this so far...
def convertLL(easting, northing):
vlon36, vlat36 = vgrid(easting, northing, inverse=True)
converted = transform(v36, v84, vlon36, vlat36)
row = row['longitude'] = converted[0]
return row
values ['longitude'] = values.apply(lambda row: convertLL(row['easting'], row['northing']), axis=1)
This converted version is working and is faster than my old one and does not time out on larger datasets, but this only works for the longitude, is there a way to get it to do latitude as well?
Also, is this vectorized? Can I speed things up any more?
EDIT
A sample of data...
name | northing | easting | latitude | longitude
------------------------------------------------
tl1 | 378778 | 366746 | |
tl2 | 384732 | 364758 | |
Because of the subject matter, I think we couldn't see the wood for the trees. If we look at the docs for transform
you'll see:
- xx (scalar or array (numpy or python)) – Input x coordinate(s).
- yy (scalar or array (numpy or python)) – Input y coordinate(s).
Great; the numpy array is exactly what we need. A pd.DataFrame
can be thought of as a dictionary of arrays, so we just need to isolate those columns and pass them to the function. There's a tiny catch - columns of a DataFrame
will be a Series
, which transform
will reject, so we just need to use the values
attribute. This mini example is directly equivalent to your initial approach:
def vectorized_convert(df):
vlon36, vlat36 = vgrid(df['easting'].values,
df['northing'].values,
inverse=True)
converted = transform(v36, v84, vlon36, vlat36)
df['longitude'] = converted[0]
df['latitude'] = converted[1]
return df
df = pd.DataFrame({'northing': [378778, 384732],
'easting': [366746, 364758]})
print(vectorized_convert(df))
And we're done. With that aside, we can look to timings for 100 rows (the current approach explodes for my usual 100,000 rows for timing examples):
def current_way(df):
df = df.apply(convertLL, axis=1)
return df
def vectorized_convert(df):
vlon36, vlat36 = vgrid(df['easting'].values,
df['northing'].values,
inverse=True)
converted = transform(v36, v84, vlon36, vlat36)
df['longitude'] = converted[0]
df['latitude'] = converted[1]
return df
df = pd.DataFrame({'northing': [378778, 384732] * 50,
'easting': [366746, 364758] * 50})
Gives:
%timeit current_way(df)
289 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vectorized_convert(df)
2.95 ms ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)