pythonpandasnumpymatrixtensor

Tensor (5-dim matrix) in python pandas?


I have a CSV file that looks like

time Col_A Col_B Col_C Col_D Price
123 A1 B1 C2 D6 23.43
124 A5 B3 C7 D1 14.63
125 A3 B2 C3 D2 343.43
126 A2 B1 C2 D6 43.43
127 A1 B1 C7 D2 6.63

Now I want to create a 5-dimensional matrix. I think it is called a tensor?

Question 1) Can this tensor be done in Pandas, or is there a better Python library? Do you have a link to an API manual so I can read more?

The dimensions of the tensor will be "time", "Col_A", "Col_B", "Col_C", "Col_D". The cell will contain the scalar "Price".

So I want to be able to set the cell, maybe something like:

my_matrix[time=123, Col_A = A1, Col_B = B1, Col_C = C2, Col_D = D6] = 23.43

Question 2) So what is the syntax to set a scalar? And how do I read the scalar?

I would also like to sum over dimensions. Say I want to sum like this below. When I write "*" I mean the star is a wildcard.

Matrix[time = *, Col_A = A1, Col_B = B1 and B2 and B3, Col_C = *, Col_D = D6 and D2]

Question 3) How do I sum over different dimensions? Do I need to for loop? Can I use other operations, such as divisions?

Actually, I want to put several numbers in each cell. Maybe I would like to put "time" and "Price" into each cell. I was thinking about string concatenation "time_Price", but to do a large summation, there will be lot of substring extraction, which might be time consuming? Therefore, I was thinking of creating several identical tensors. One tensor might contain the "Price" in each cell, and another tensor might contain "time" in each cell. So if I want to check the time for a Price, I will use the same coordinates in both tensors.

Question 4) I guess it is faster to use several tensors where each cell contain a single value, instead of using one single tensor containing many variables i.e. long string of concatenated variables? Or is there a better way?


Solution

  • It seems that a dataframe is already an appropriate structure for your data. To answer your questions I'll use your sample data, indexed by the Col_* columns:

    # set up df
    df = pd.DataFrame({
        'time': [123, 124, 125, 126, 127],
        'Col_A': ['A1', 'A5', 'A3', 'A2', 'A1'],
        'Col_B': ['B1', 'B3', 'B2', 'B1', 'B1'],
        'Col_C': ['C2', 'C7', 'C3', 'C2', 'C7'],
        'Col_D': ['D6', 'D1', 'D2', 'D6', 'D2'],
        'Price': [23.43, 14.63, 343.43, 43.43, 6.63]
    })
    
    # set the index to the `Col_*` columns and sort for performance
    df = df.set_index(['Col_A','Col_B','Col_C','Col_D']).sort_index()
    

    Now we have a dataframe where the Col_* values are the indexes and the time and Price values are in the "cells". This is basically the answer to Question 1.

                             time   Price
    Col_A Col_B Col_C Col_D
    A1    B1    C2    D6      123   23.43
                C7    D2      127    6.63
    A2    B1    C2    D6      126   43.43
    A3    B2    C3    D2      125  343.43
    A5    B3    C7    D1      124   14.63
    

    For Question 2 you can just use the index to set the "cell" value:

    df.loc[('A1','B1','C2','D6'), 'Price'] = 55
    

    or both "cells" at once:

    df.loc[('A1','B1','C7','D2'), ['time', 'Price']] = [100, 99]
    

    or with a condition:

    df.loc[(df.index == ('A2','B1','C2','D6')) & (df['time'] == 126), 'Price'] = 44
    

    Output:

                             time   Price
    Col_A Col_B Col_C Col_D
    A1    B1    C2    D6      123   55.00
                C7    D2      100   99.00
    A2    B1    C2    D6      126   44.00
    A3    B2    C3    D2      125  343.43
    A5    B3    C7    D1      124   14.63
    

    For Question 3, use indexing to access the values you want; with slice(None) for the wildcard:

    df.loc[('A1', ['B1','B2','B3'], slice(None), ['D2','D6'])].sum()
    

    Output:

    time     223.0
    Price    154.0
    dtype: float64
    

    In terms of Question 4, inside the dataframe you can have as many "cell" values as you like. As shown above, you can access the "cell" values individually or together as needed. For example, let's say you wanted address and gender information. Then you could set a few of those values like this:

    # set an existing entry
    df.loc[('A1','B1','C7','D2'), ['Address', 'Gender']] = ['1 Main St', 'male']
    
    # create a new entry with all 4 values
    # (time, price, address, gender)
    df.loc[('A4', 'B4', 'C4', 'D4')] = [130, 45.3, 'High St', 'female']
    
    # create a new entry with only 3 of the 4 values
    df.loc[('A1', 'B2', 'C3', 'D4'), ['time', 'Price', 'Address']] = [120, 51.4, '4 Brown St']
    

    Starting from the original dataframe, this gives:

                  time   Price     Address  Gender
    A1 B1 C2 D6  123.0   23.43         NaN     NaN
          C7 D2  127.0    6.63   1 Main St    male
    A2 B1 C2 D6  126.0   43.43         NaN     NaN
    A3 B2 C3 D2  125.0  343.43         NaN     NaN
    A5 B3 C7 D1  124.0   14.63         NaN     NaN
    A4 B4 C4 D4  130.0   45.30     High St  female
    A1 B2 C3 D4  120.0   51.40  4 Brown St     NaN
    

    Note that for performance reasons, you should use df = df.sort_index() after adding new entries.

    You can also search this dataframe easily. For example, find all rows with Price > 50:

    df[df['Price'] > 50]
    

    Output:

                  time   Price     Address Gender
    A3 B2 C3 D2  125.0  343.43         NaN    NaN
    A1 B2 C3 D4  120.0   51.40  4 Brown St    NaN
    

    or time < 125 and no gender specified:

    df[(df['time'] < 125) & df['Gender'].isna()]
    

    Output:

                  time  Price     Address Gender
    A1 B1 C2 D6  123.0  23.43         NaN    NaN
    A5 B3 C7 D1  124.0  14.63         NaN    NaN
    A1 B2 C3 D4  120.0  51.40  4 Brown St    NaN