pythonnumpy

Axes labels within numpy arrays


Does np.ndarray have the functionality of carrying axes labels?

Let's say that I have a 2-D array with dimensions being time and speed. I want to actually have both axes labels (time and speed values) embedded in an object, so that the object takes care of the axes whenever I do operations with the array (e.g. slice or even plot).

In the field I work it's common to have recordings from multiple sensors, and then segment the data so that you have multiple samples/events. If each sensor captures a 1-dimensional signal across time, one have dimensions [Sensor, Event, Time] (dimension is implicit in the data itself).

When using pure numpy.ndarray, you'll end up with variables: data, a 3-D array with the recorded data; sensor, a 1-D np.recarray with all the information for each sensor (e.g. name, location, ...); event, a 1-D np.recarray with all the information for each sample/event (e.g. type, offset, ...); and Time, a vector with the time values.

I want to have all that information in a single object mydata and don't worry about basic manipulations (slicing). So that mydata[0:3, 1:10] will slice the corresponding dimensions accordingly.

I agree that things like plotting will be data specific, but I'll happily code a subclass of such object with some extra functions (e.g. plot).

Why would this be useful?

Readability: Compare

data1 = data[0:3, 1:10]
sensor1 = sensor[0:3]
event1 = event[1:10]
time1 = time

with a simple

mydata1 = mydata[0:3, 1:10]

Maintenance: The second option is obviously easier to maintain and less prone to errors in the correct slicing of all associated variable.

Convenience: Having all this information in the same place allows to integrate useful and powerful function within the class. For example, if I create a derived class for time series (forcing to have a time axis), I can run time specific functions without having to specify time or sampling frequency (as this information is within the object itself). The idea is to have a base class carrying axes' labels, and specific subclasses will naturally arise when necessary (e.g. one for time series, one for video, one for topographic information, etc) incorporating specialized functionality.

Close but not exactly

As @user2357112 mentioned, Pandas' DataFrame is close to what I'm looking for. But, apart from the fact that N-D arrays is still experimental, it seems to be too much oriented to a table-like behaviour (for what I've read so far), e.g. treating the first dimension differently than the others (items vs columns).

Is it worth it?

The above may seem trivial, and not worth the effort, but I programmed a subclass of np.ndarray with such functionality a few years ago and I can assure you it made my life and code so much easier! (The specific application was similar to the example above [sensor, sample, time]). But that was back when I was learning python and the way I coded it isn't what you'll call pretty. It also has some fundamental faults, like the axes labels not following the same share-memory rules as np.ndarray.

Before embarking in the trouble of rewrite this thing and make it public, I wanted to know if there's something similar out there.


Solution

  • What you may be looking for is xarray.


    From its documentation:

    xarray: N-D labeled arrays and datasets in Python

    xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

    Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

    Xarray was inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.