This question was edited after answers for show final solution I used
I have unstructured 2D datasets coming from different sources, like by example:
Theses datasets are 3 numpy.ndarray (X, Y coordinates and Z value).
My final aim is to interpolate theses datas on a grid for conversion to image/matrix. So, I need to find the "best grid" for interpolate theses datas. And, for this I need to find the best X and Y step between pixels of that grid.
Determinate step based on Euclidean distance between points:
Use the mean of Euclidean distances between each point and its nearest neighbour.
KDTree
/cKDTree
from scipy.spacial for build tree of the X,Y datas.query
method with k=2
for get the distances (If k=1
, distances are only zero because query for each point found itself).
# Generate KD Tree
xy = np.c_[x, y] # X,Y data converted for use with KDTree
tree = scipy.spacial.cKDTree(xy) # Create KDtree for X,Y coordinates.
# Calculate step
distances, points = tree.query(xy, k=2) # Query distances for X,Y points
distances = distances[:, 1:] # Remove k=1 zero distances
step = numpy.mean(distances) # Result
Performance tweaking:
scipy.spatial.cKDTree
and not scipy.spatial.KDTree
because it is really faster.balanced_tree=False
with scipy.spatial.cKDTree
: Big speed up in my case, but may not be true for all data.n_jobs=-1
with cKDTree.query
for use multithreading.p=1
with cKDTree.query
for use Manhattan distance in place of Euclidian distance (p=2
): Faster but may be less accurate.Interpolate points on grid:
Interpolate dataset points on grid using the calculated step.
# Generate grid
def interval(axe):
'''Return numpy.linspace Interval for specified axe'''
cent = axe.min() + axe.ptp() / 2 # Interval center
nbs = np.ceil(axe.ptp() / step) # Number of step in interval
hwid = nbs * step / 2 # Half interval width
return np.linspace(cent - hwid, cent + hwid, nbs) # linspace
xg, yg = np.meshgrid(interval(x), interval(y)) # Generate grid
# Interpolate X,Y,Z datas on grid
zg = scipy.interpolate.griddata((x, y), z, (xg, yg))
Set NaN if pixel too far from initials points:
Set NaN to pixels from grid that are too far (Distance > step) from points from initial X,Y,Z data. The previous generated KDTree is used.
# Calculate pixel to X,Y,Z data distances
dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()])
dist = dist.reshape(xg.shape)
# Set NaN value for too far pixels
zg[dist > step] = np.nan
I suggest you to go with KDTree.query
.
You are searching of a carachteristic distance to scale your binning: I suggest you to take only a random subset of your points, and to use the Manhattan distance, becasue KDTree.query
is very slow (and yet it is a n*log(n) complexity).
Here is my code:
# CreateTree
tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy?
# Create random subsample of points
n_repr=1000
shuffled_points=numpy.array(points)
numpy.random.shuffle(shuffled_points)
shuffled_points=shuffled_points[:n_repr]
# Query the tree
(dists,points)=tree.query(shuffled_points,k=2,p=1)
# Get _extimate_ of average distance:
avg_dists=numpy.average(dists)
print('average distance Manhattan with nearest neighbour is:',avg_dists)
I suggest you to use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it is was faster to compute than the euclidean distance. And since you need only an estimator of the average distance it should be sufficient.