I'm plotting 550,000,000 latitudes and longitudes using datashader
. But, for this to be useful, I need to overlay map tiles and polygons using geoviews
. The problem is that geoviews.points()
and associated projection results in a large slowdown that makes the interactive nature of holoview
+ bokeh
plots redundant.
There's a reproducible example below, but, in short - I'm trying to make the geoviews implementation (3) fast enough to work interactively.
import numpy as np
import pandas as pd
import dask.dataframe as dd
import datashader as ds
import datashader.transfer_functions as tf
import holoviews as hv
from holoviews.operation.datashader import datashade
import geopandas as gpd
import geoviews as gv
Scaling the size of the data down by 10 for example.
uk_bounding_box = (-14.02,2.09,49.67,61.06)
n = int(550000000 / 10)
# Generate some fake data of the same size
df = dd.from_pandas(
pd.DataFrame.from_dict({
'longitude': np.random.normal(
np.mean(uk_bounding_box[0:2]),
np.diff(uk_bounding_box[0:2]) / 5, n
),
'latitude': np.random.normal(
np.mean(uk_bounding_box[2:4]),
np.diff(uk_bounding_box[2:4]) / 5, n
)
}), npartitions=8
)
# Persist data in memory so reading wont slow down datashader
df = df.persist()
Just using datashader without holoviews or geo is very quick - output is rendered in 4 seconds, including aggregation, so re-renders would be event faster if interactive.
# Set some plotting params
bounds = dict(x_range = uk_bounding_box[0:2],
y_range = uk_bounding_box[2:4])
plot_width = 400
plot_height = 300
The time for pure datashader version:
%%time
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)
agg = cvs.points(df, 'longitude', 'latitude', ds.count())
CPU times: user 968 ms, sys: 29.9 ms, total: 998 ms Wall time: 506 ms
tf.shade(agg)
datashader
in holoviews
without geoviews
projection# Set some params
sizes = dict(width=plot_width, height=plot_height)
opts = dict(bgcolor="black", **sizes)
hv.extension('bokeh')
hv.util.opts('Image Curve RGB Polygons [width=400 height=300 shared_axes=False] {+axiswise} ')
Without any projection this is comparable to using pure datashader
%%time
points = hv.Points(df, ['longitude', 'latitude']).redim.range(
x=bounds['x_range'], y=bounds['y_range'])
shader = datashade(points, precompute=True ,**sizes).options(**opts)
CPU times: user 3.32 ms, sys: 131 µs, total: 3.45 ms Wall time: 3.47 ms
shader
datashader
in holoviews
with geoviews
tiles, polygons, and projectionHere's the crux of the issue - I want to align, the datashader layer with some map tiles and geospatial polygons. This results in a large slowdown that, for the size of data I'm dealing with, makes the interactive visualisation redundant. (12 minutes wait time in total for the render).
I'm sure this is to do with the overhead associated with projecting the points - is there a way to avoid this or any other workaround like precomputing the projection?
# Grab an example shape file to work with
ne_path = gpd.datasets.get_path('naturalearth_lowres')
example_shapes_df = gpd.read_file(ne_path)
uk_shape = example_shapes_df[example_shapes_df.name.str.contains('United K')]
# Grab maptiles
map_tiles = gv.tile_sources.ESRI
# In actual workflow I need to add some polygons
polys = gv.Polygons(uk_shape)
This is as above with the addition of gv.points()
and projection
%%time
points = gv.Points(df, ['longitude', 'latitude']).redim.range(
x=bounds['x_range'], y=bounds['y_range'])
projected = gv.operation.project_points(points)
shader = datashade(projected, precompute=True ,**sizes).options(**opts)
CPU times: user 11.8 s, sys: 3.16 s, total: 15 s Wall time: 12.5 s
shader * map_tiles * polys
As suggested by @philippjfr the solution is to project the coordinates into the appropriate coordinate system and render using methods 2 or 3 above.
Something like this:
import cartopy
def platcaree_to_mercator_vectorised(x, y):
'''Use cartopy to convert Platecarree coords to Mercator.'''
return(cartopy.crs.GOOGLE_MERCATOR.transform_points(
cartopy.crs.PlateCarree(), x, y))
def platcaree_for_map_partitions(pddf):
'''Wrapper to apply mercator conversion and convert back to dataframe for Dask.'''
as_arrays = platcaree_to_mercator_vectorised(pddf.longitude.values,pddf.latitude.values)
as_df = pd.DataFrame.from_records(as_arrays[:, :2], columns=['longitude', 'latitude'])
return(as_df)
# Project the points
df_projected = df.map_partitions(platcaree_for_map_partitions,
meta={'longitude': 'f8', 'latitude': 'f8'})
from dask.diagnostics import ProgressBar
with ProgressBar():
df_projected.to_parquet('abb_projected.parquet', compression='SNAPPY')
Then use this projected dataset with method 2 or 3, detailed in question.