pythonperformancematplotlibscientific-computing

Interactive large plot with ~20 million sample points and gigabytes of data


I have got a problem (with my RAM) here: it's not able to hold the data I want to plot. I do have sufficient HD space. Is there any solution to avoid that "shadowing" of my data-set?

Concretely I deal with Digital Signal Processing and I have to use a high sample-rate. My framework (GNU Radio) saves the values (to avoid using too much disk space) in binary. I unpack it. Afterwards I need to plot. I need the plot zoomable, and interactive. And that is an issue.

Is there any optimization potential to this, or another software/programming language (like R or so) which can handle larger data-sets? Actually I want much more data in my plots. But I have no experience with other software. GNUplot fails, with a similar approach to the following. I don't know R (jet).

import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
import struct

"""
plots a cfile

cfile - IEEE single-precision (4-byte) floats, IQ pairs, binary
txt - index,in-phase,quadrature in plaintext

note: directly plotting with numpy results into shadowed functions
"""

# unpacking the cfile dataset
def unpack_set(input_filename, output_filename):
    index = 0   # index of the samples
    output_filename = open(output_filename, 'wb')

    with open(input_filename, "rb") as f:

        byte = f.read(4)    # read 1. column of the vector

        while byte != "":
        # stored Bit Values
            floati = struct.unpack('f', byte)   # write value of 1. column to a variable
            byte = f.read(4)            # read 2. column of the vector
            floatq = struct.unpack('f', byte)   # write value of 2. column to a variable
            byte = f.read(4)            # next row of the vector and read 1. column
            # delimeter format for matplotlib 
            lines = ["%d," % index, format(floati), ",",  format(floatq), "\n"]
            output_filename.writelines(lines)
            index = index + 1
    output_filename.close
    return output_filename.name

# reformats output (precision configuration here)
def format(value):
    return "%.8f" % value            

# start
def main():

    # specify path
    unpacked_file = unpack_set("test01.cfile", "test01.txt")
    # pass file reference to matplotlib
    fname = str(unpacked_file)
    plt.plotfile(fname, cols=(0,1)) # index vs. in-phase

    # optional
    # plt.axes([0, 0.5, 0, 100000]) # for 100k samples
    plt.grid(True)
    plt.title("Signal-Diagram")
    plt.xlabel("Sample")
    plt.ylabel("In-Phase")

    plt.show();

if __name__ == "__main__":
    main()

Something like plt.swap_on_disk() could cache the stuff on my SSD ;)


Solution

  • So your data isn't that big, and the fact that you're having trouble plotting it points to issues with the tools. Matplotlib has lots of options and the output is fine, but it's a huge memory hog and it fundamentally assumes your data is small. But there are other options out there.

    So as an example, I generated a 20M data-point file 'bigdata.bin' using the following:

    #!/usr/bin/env python
    import numpy
    import scipy.io.numpyio
    
    npts=20000000
    filename='bigdata.bin'
    
    def main():
        data = (numpy.random.uniform(0,1,(npts,3))).astype(numpy.float32)
        data[:,2] = 0.1*data[:,2]+numpy.exp(-((data[:,1]-0.5)**2.)/(0.25**2))
        fd = open(filename,'wb')
        scipy.io.numpyio.fwrite(fd,data.size,data)
        fd.close()
    
    if __name__ == "__main__":
        main()
    

    This generates a file of size ~229MB, which isn't all that big; but you've expressed that you'd like to go to even larger files, so you'll hit memory limits eventually.

    Let's concentrate on non-interactive plots first. The first thing to realize is that vector plots with glyphs at each point are going to be a disaster -- for each of the 20 M points, most of which are going to overlap anyway, trying to render little crosses or circles or something is going to be a diaster, generating huge files and taking tonnes of time. This, I think is what is sinking matplotlib by default.

    Gnuplot has no trouble dealing with this:

    gnuplot> set term png
    gnuplot> set output 'foo.png'
    gnuplot> plot 'bigdata.bin' binary format="%3float32" using 2:3 with dots
    

    gnuplot

    And even Matplotlib can be made to behave with some caution (choosing a raster back end, and using pixels to mark points):

    #!/usr/bin/env python
    import numpy
    import matplotlib
    matplotlib.use('Agg')
    import matplotlib.pyplot as plt
    
    datatype=[('index',numpy.float32), ('floati',numpy.float32), 
            ('floatq',numpy.float32)]
    filename='bigdata.bin'
    
    def main():
        data = numpy.memmap(filename, datatype, 'r') 
        plt.plot(data['floati'],data['floatq'],'r,')
        plt.grid(True)
        plt.title("Signal-Diagram")
        plt.xlabel("Sample")
        plt.ylabel("In-Phase")
        plt.savefig('foo2.png')
    
    if __name__ == "__main__":
        main()  
    

    matplotlib

    Now, if you want interactive, you're going to have to bin the data to plot, and zoom in on the fly. I don't know of any python tools that will help you do this offhand.

    On the other hand, plotting-big-data is a pretty common task, and there are tools that are up for the job. Paraview is my personal favourite, and VisIt is another one. They both are mainly for 3D data, but Paraview in particular does 2d as well, and is very interactive (and even has a Python scripting interface). The only trick will be to write the data into a file format that Paraview can easily read.