pythonmatplotlibscatter-plot

For identical XY coordinates in the Matplotlib scatter function, how can I sum all the related array-like data values (c-parameter) on the plot?


I have three separate 1-Dimensional Numpy arrays of equal length that I am using as x, y and c parameter inputs to the matplotlib scatter function without a problem. Some of the plot coordinates contained within the x and y arrays are duplicated. Where the coordinates are duplicated, I would like to plot the sum of all the related c parameter (data) values.

Is there a built-in matplotlib way of doing this? Alternatively, I think that I need to remove all the duplicated coordinates from the x and y array and the associated values from the data array. But before doing this, the associated data values must be added to the data array related to the remaining coordinates.

A trivial example is shown below where the duplicated coordinates have been removed and data values added to the one remaining coordinate pair.

Before
x =    np.array([3, 7, 12, 3, 56, 4, 2, 3, 65, 87, 12, 3, 9, 7, 87])
y =    np.array([7, 24, 87, 9, 65, 43, 54, 9, 3, 8, 34, 9, 23, 6, 8])
data = np.array([6, 45, 4, 25, 7, 45, 78, 4, 82, 3, 9, 43, 32, 5, 9])

After
x =    np.array([3, 7, 12, 3, 56, 4, 2, 65, 87, 12, 9, 7])
y =    np.array([7, 24, 87, 9, 65, 43, 54, 3, 8, 34, 23, 6])
data = np.array([6, 45, 4, 72, 7, 45, 78, 4, 12, 9, 32, 5])

I have found an algorithm on Stackoverflow that removes the duplicate coordinates from the x and y arrays in seconds using Python zip and a set. However, my attempt to extend this to the data array took an hour to execute and I don't have the experience to improve on this. The arrays are typically 600,000 elements long.


Solution

  • The following attempt is pretty fast even for much larger datasets than the ones you are dealing with. I tested a size of 6_000_000 for x,y and data and it still was finished within about 10s, not using a particularly powerful machine.

    What is time consuming, though, is printing of the arrays if they reach a certain size.

    import numpy as np
    
    # generating some test data
    x = np.random.randint(0, 100_000, 600_000)
    y = np.random.randint(0, 100_000, 600_000)
    data = np.random.randint(0, 10_000, 600_000)
    
    #initializing the result dict
    #set(zip()) make sure we are dealing only with unique x/y pairs
    data_tmp = {key: 0 for key in set(zip(x,y))}
    
    # determine sum for each unique x,y pair
    for key, val in zip(zip(x,y),data):
        data_tmp[key] += val
    
    # translating the dict to your cleaned up arrays
    x_after = [a for a,_ in data_tmp.keys()]
    y_after = [b for _,b in data_tmp.keys()]
    data_after = data_tmp.values()
    

    As a sidenote:

    Checking the code on your example I realized your data seems to be wrong. The second 4 needs to be 82.