I have three separate 1-Dimensional Numpy arrays of equal length that I am using as x, y and c parameter inputs to the matplotlib scatter function without a problem. Some of the plot coordinates contained within the x and y arrays are duplicated. Where the coordinates are duplicated, I would like to plot the sum of all the related c parameter (data) values.
Is there a built-in matplotlib way of doing this? Alternatively, I think that I need to remove all the duplicated coordinates from the x and y array and the associated values from the data array. But before doing this, the associated data values must be added to the data array related to the remaining coordinates.
A trivial example is shown below where the duplicated coordinates have been removed and data values added to the one remaining coordinate pair.
Before
x = np.array([3, 7, 12, 3, 56, 4, 2, 3, 65, 87, 12, 3, 9, 7, 87])
y = np.array([7, 24, 87, 9, 65, 43, 54, 9, 3, 8, 34, 9, 23, 6, 8])
data = np.array([6, 45, 4, 25, 7, 45, 78, 4, 82, 3, 9, 43, 32, 5, 9])
After
x = np.array([3, 7, 12, 3, 56, 4, 2, 65, 87, 12, 9, 7])
y = np.array([7, 24, 87, 9, 65, 43, 54, 3, 8, 34, 23, 6])
data = np.array([6, 45, 4, 72, 7, 45, 78, 4, 12, 9, 32, 5])
I have found an algorithm on Stackoverflow that removes the duplicate coordinates from the x and y arrays in seconds using Python zip and a set. However, my attempt to extend this to the data array took an hour to execute and I don't have the experience to improve on this. The arrays are typically 600,000 elements long.
The following attempt is pretty fast even for much larger datasets than the ones you are dealing with. I tested a size of 6_000_000 for x,y and data and it still was finished within about 10s, not using a particularly powerful machine.
What is time consuming, though, is printing of the arrays if they reach a certain size.
import numpy as np
# generating some test data
x = np.random.randint(0, 100_000, 600_000)
y = np.random.randint(0, 100_000, 600_000)
data = np.random.randint(0, 10_000, 600_000)
#initializing the result dict
#set(zip()) make sure we are dealing only with unique x/y pairs
data_tmp = {key: 0 for key in set(zip(x,y))}
# determine sum for each unique x,y pair
for key, val in zip(zip(x,y),data):
data_tmp[key] += val
# translating the dict to your cleaned up arrays
x_after = [a for a,_ in data_tmp.keys()]
y_after = [b for _,b in data_tmp.keys()]
data_after = data_tmp.values()
As a sidenote:
Checking the code on your example I realized your data
seems to be wrong. The second 4 needs to be 82.