pythonbokehscatter-plotholoviewsdatashader

How to tell datashader to use gray color for NA values when plotting categorical data


I have very large dataset that I cannot plot directly using holoviews. I want to make a scatterplot with categorial data. Unfortunately my data is very sparse and many points have NA as category. I would like to make these points gray. Is there any way to make datashader know what I want to do?

I show you the way I do it now (as more or less proposed in https://holoviews.org/user_guide/Large_Data.html ). I provide you an example:

import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
import datashader as ds
from datashader.colors import Sets1to3
from holoviews.operation.datashader import datashade,spread



raw_data = [('Alice', 60, 'London', 5) ,
           ('Bob', 14, 'Delhi' , 7) ,
           ('Charlie', 66, np.NaN, 11) ,
           ('Dave', np.NaN,'Delhi' , 15) ,
           ('Eveline', 33, 'Delhi' , 4) ,
           ('Fred', 32, 'New York', np.NaN ),
           ('George', 95, 'Paris', 11)
            ]
# Create a DataFrame object
df = pd.DataFrame(raw_data, columns=['Name', 'Age', 'City', 'Experience'])
df['City']=pd.Categorical(df['City'])



x='Age'
y='Experience'
color='City'
cats=df[color].cat.categories





# Make dummy-points (currently the only way to make a legend: https://holoviews.org/user_guide/Large_Data.html)
for cat in cats:
    #Just to make clear how many points of a given category we have
    print(cat,((df[color]==cat)&(df[x].notnull())&(df[y].notnull())).sum())
color_key=[(name,color) for name, color in zip(cats,Sets1to3)]
color_points = hv.NdOverlay({n: hv.Points([0,0], label=str(n)).opts(color=c,size=0) for n,c in color_key})


# Create the plot with datashader
points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),)
datashaded=datashade(points,aggregator=ds.by(color)).opts(width=800, height=480)

(spread(datashaded,px=4, shape='square')*color_points).opts(legend_position='right')

It produces the following picture: picture of the scatterplot

You can see some issues: Most importantly although there is just one person from Paris you see that the NA-person (Charlie) is also printed in purple, the color for Paris. Is there a way to make the dot gray? I have tried many plots and it seems like the NAs always take the color of the last item in the legend.

Then there are some minor issues I have I did not want to open questions for. (If you think they deserve their own question please tell me, I am new to stackoverflow and appreciate your advice.)

One other problem: The dots are not all of the same size. This is quite ugly. Is there a way to change that?

And then there is also a question that I have: Does the datashader internally also use the .cat.categories-method to decide what color to use? How are the colors, that datashader uses, determined? Because I wonder whether the legend is always in correct order (showing the correct colors: If you permute the order in cats then color_key and cats are not in the same order anymore and the legend shows wrong colors). It seems to always work the way I do but I feel a bit insecure.

And maybe someone wants to give their opinion whether Points is okay to use for scatterplots in this case. Because I do not see any difference to Scatter and also semantically there is not really one variable that causes the other (although one might argue that age causes experience in this case, but I am going to plot variables where it is not easy at all to find those kinds of causalities) so it is best to use Points if I understood the documentation https://holoviews.org/reference/elements/bokeh/Points.html correctly.


Solution

  • Most importantly although there is just one person from Paris you see that the NA-person (Charlie) is also printed in purple, the color for Paris. Is there a way to make the dot gray? I have tried many plots and it seems like the NAs always take the color of the last item in the legend.

    Right now I believe Datashader replaces NaNs with zeros (see https://github.com/holoviz/datashader/blob/master/datashader/transfer_functions/__init__.py#L351). Seems like a good feature request to be able to supply Datashader with a color to use for NaNs instead, but in the meantime, I'd recommend replacing the NaNs with an actual category name like "Other" or "Missing" or "Unknown", and then both the coloring and the legend should reflect that name.

    One other problem: The dots are not all of the same size. This is quite ugly. Is there a way to change that?

    Usually Datashader in a Bokeh HoloViews plot will render once initially before it is put into a Bokeh layout, and will be triggered to update once the layout is finished with a final version. Here, the initial rendering is being auto-ranged to precisely the range of the data points, then clipped by the boundaries of the plot (making squares near the edges become rectangles), and then the range of the plot is updated once the legend is added. To see how that works, remove *color_points and you'll see the same shape of dots, but now cropped by the plot edges:

    cropped points

    You can manually trigger an update to the plot by zooming or panning slightly once it's displayed, but to force it to update without needing manual intervention, you can supply an explicit plot range:

    points=hv.Points(df, [x, y],label="%s vs %s" % (x, y),).redim.range(Age=(0,90), Experience=(0,14))
    

    Fixed

    It would be great if you could file a bug report on HoloViews asking why it is not refreshing automatically in this case when the legend is included. Hopefully a simple fix!

    Does the datashader internally also use the .cat.categories-method to decide what color to use? How are the colors, that datashader uses, determined? Because I wonder whether the legend is always in correct order (showing the correct colors: If you permute the order in cats then color_key and cats are not in the same order anymore and the legend shows wrong colors). It seems to always work the way I do but I feel a bit insecure.

    Definitely! Whenever you show a legend, you should be sure to pass in the color key as a dictionary so that you can be sure that the legend and the plot coloring use the same colors for each category. Just pass in color_key={k:v for k,v in color_key} when you call datashade.

    And maybe someone wants to give their opinion whether Points is okay to use for scatterplots in this case. Because I do not see any difference to Scatter and also semantically there is not really one variable that causes the other (although one might argue that age causes experience in this case, but I am going to plot variables where it is not easy at all to find those kinds of causalities) so it is best to use Points if I understood the documentation https://holoviews.org/reference/elements/bokeh/Points.html correctly.

    Points is meant for a 2D location in a plane where both axes are interchangeable, such as the physical x,y location on a geometric plane. Points has two independent dimensions, and expects you to have any dependent dimensions be used for the color, marker shape, etc. It's true that if you don't know which variable might be dependent on the other, a Scatter is tricky to use, but simply choosing to put something on the x axis will set people up to think that that variable is independent, so there's not much that can be done about it. Definitely not appropriate to use Points in this case.