pythonscatter-plotholoviewsdatashaderhvplot

how to create interactive graph on a large data set?


I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs

Event         Time             ID     Venue    
Javeline      11:25:21:012345  JVL    Dome
Shot pot      11:25:22:778929  SPT    Dome
4x4           11:25:21:993831  FOR    Track
4x4           11:25:22:874293  FOR    Track
Shot pot      11:25:21:087822  SPT    Dome
Javeline      11:25:23:878792  JVL    Dome
Long Jump     11:25:21:892902  LJP    Aquatic
Long Jump     11:25:22:799422  LJP    Aquatic

This is how I read the data and plot a scatter plot.

trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter

Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up. I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code. Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help


Solution

  • Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.

    A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).

    Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.

    If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).

    stocks taxi tips