pythondataframeperformancedata-structures

Is there a name for this type of structured data, and what is a more efficient way to use it?


I'll try to make this question simple so that hopefully the concept is applicable to other fields.

I work with seismic datasets, where N earthquakes are recorded by M seismic stations. There is earthquake-specific information (earthquake ID, magnitude, etc.), and station-specific information (station ID, instrument sampling rate, etc.). Not all seismic stations record all earthquakes, so there are a variable number of stations that record each earthquake.

Currently, I organize this data in a Pandas DataFrame object, with one line per earthquake. Since there are a variable number of stations for each earthquake, I store the station IDs as a numpy array. Here is a simple example, with earthquakes 0001 and 0002 of magnitude 2.1 and 3.1, respectively. Four seismic stations, a, b, c, and d, have sampling rates of 100, 100, 60, and 60, respectively.

import numpy as np 
import pandas as pd

events = {
    'event_id': ['0001', '0002'],
    'mag': [2.1, 3.1],
    'station_ids': [np.array(['a', 'c']), np.array(['a', 'b', 'c', 'd'])],
    'station_sampling_rate': [np.array([100, 60]), np.array([100, 100, 60, 60])]
}
events = pd.DataFrame(events)
print(events)

which results in

  event_id  mag   station_ids station_sampling_rate
0     0001  2.1        [a, c]             [100, 60]
1     0002  3.1  [a, b, c, d]    [100, 100, 60, 60]

Now, when I need station-specific information from this DataFrame, either I need to:

  1. Loop through each line of events and perform some check, and compare to an existing catalog of stations (slow)
  2. Construct events in such a way that it holds increasingly more station-specific information for each station (events gets really big), and still loop through each line of events

Method 1 has worked for me so far, but is slow. Method 2 seems like a waste of resources.

The desired structure of the above example is this:

  station_id  sampling_rate     event_ids         mag
0          a            100  [0001, 0002]  [2.1, 3.1]
1          b            100        [0002]       [3.1]
2          c             60  [0001, 0002]  [2.1, 3.1]
3          d             60        [0002]       [3.1]

Is there an efficient way to structure, store, and transform this data? It seems there has to be something existing already to deal with this type of data.

I tried searching, but I don't know what this data structure is called, or if it has a name.


Solution

  • Looping through rows is intrinsically very slow in Pandas and (almost) never needed. For the example given:

    events2 = events.explode(['station_ids', 'station_sampling_rate'])
    
    results = events2.groupby(['station_ids', 'station_sampling_rate'], as_index = False)[['mag', 'event_id']].agg(list)
    
    print(results)
    

    gives the DF:

      station_ids station_sampling_rate         mag      event_id
    0           a                   100  [2.1, 3.1]  [0001, 0002]
    1           b                   100       [3.1]        [0002]
    2           c                    60  [2.1, 3.1]  [0001, 0002]
    3           d                    60       [3.1]        [0002]