pythonnumbajit

numba jitclass with record type of string


The v3 variable is string value. I could not run with below code which gives error.

import numpy as np
import pandas as pd
from numba.experimental import jitclass
from numba import types
import os

os.environ['NUMBA_VERBOSE'] = '1'

# ----- BEGINNING OF THE MODIFIED PART ----- #
recordType = types.Record([
    ('v', {'type': types.int64, 'offset': 0, 'alignment': None, 'title': None}),
    ('v2', {'type': types.float64, 'offset': 8, 'alignment': None, 'title': None}),
    ('v3', {'type': types.bytes, 'offset': 16, 'alignment': None, 'title': None})
], 32, False)
spec = [
    ('data', types.Array(recordType, 1, 'C', False))
]
# ----- END OF THE MODIFIED PART ----- #

@jitclass(spec)
class Test:
    def __init__(self, data):
        self.data = data

    def loop(self):
        v = self.data['v']
        v2 = self.data['v2']
        v3 = self.data['v3']
        print("Inside loop:")
        print("v:", v)
        print("v2:", v2)
        print("v3:", v3)

# Create a dictionary with the data
data = {'v': [1, 2, 3], 'v2': [1.0, 2.0, 3.0], 'v3': ['a', 'b', 'c']}

# Create the DataFrame
df = pd.DataFrame(data)

# Define the structured array dtype
dtype = np.dtype([
    ('v', np.int64),
    ('v2', np.float64),
    ('v3', 'S10')  # Byte string with maximum length of 10 characters
])

print(df.to_records(index=False))

# Create the structured array
data_array = np.array(list(df.to_records(index=False)), dtype=dtype)

print("Original data array:")
print(data_array)

# Create an instance of the Test class
test = Test(data_array)
test.loop()

Errors:

/home/totaljj/miniconda3/bin/conda run -n bt --no-capture-output python /home/totaljj/bt_lite_strategies/test/test_units/test_numba_obj.py 
Traceback (most recent call last):
  File "/home/totaljj/bt_lite_strategies/test/test_units/test_numba_obj.py", line 13, in <module>
    ('v3', {'type': types.bytes, 'offset': 16, 'alignment': None, 'title': None})
AttributeError: module 'numba.core.types' has no attribute 'bytes'
ERROR conda.cli.main_run:execute(124): `conda run python /home/totaljj/bt_lite_strategies/test/test_units/test_numba_obj.py` failed. (See above for error)

Process finished with exit code 1,

Solution

  • Neither Numba 57.1, 58.1 nor 59.1 have the types.bytes type. Here you should use the type types.CharSeq(10) in your case (for the S10 Numpy type). Moreover, the final size is wrong: it should be 26 instead of 32 since there are 10 characters and the two other values takes 8 byte each (with no alignment).

    Here is the modified code:

    import numpy as np
    import pandas as pd
    from numba.experimental import jitclass
    from numba import types
    import os
    
    os.environ['NUMBA_VERBOSE'] = '1'
    
    # ----- BEGINNING OF THE MODIFIED PART ----- #
    recordType = types.Record([
        ('v', {'type': types.int64, 'offset': 0, 'alignment': None, 'title': None}),
        ('v2', {'type': types.float64, 'offset': 8, 'alignment': None, 'title': None}),
        ('v3', {'type': types.CharSeq(10), 'offset': 16, 'alignment': None, 'title': None})
    ], 26, False)
    spec = [
        ('data', types.Array(recordType, 1, 'C', False))
    ]
    # ----- END OF THE MODIFIED PART ----- #
    
    @jitclass(spec)
    class Test:
        def __init__(self, data):
            self.data = data
    
        def loop(self):
            v = self.data['v']
            v2 = self.data['v2']
            v3 = self.data['v3']
            print("Inside loop:")
            print("v:", v)
            print("v2:", v2)
            print("v3:", v3)
    
    # Create a dictionary with the data
    data = {'v': [1, 2, 3], 'v2': [1.0, 2.0, 3.0], 'v3': ['a', 'b', 'c']}
    
    # Create the DataFrame
    df = pd.DataFrame(data)
    
    # Define the structured array dtype
    dtype = np.dtype([
        ('v', np.int64),
        ('v2', np.float64),
        ('v3', 'S10')  # Byte string with maximum length of 10 characters
    ])
    
    print(df.to_records(index=False))
    
    # Create the structured array
    data_array = np.array(list(df.to_records(index=False)), dtype=dtype)
    
    print("Original data array:")
    print(data_array)
    
    # Create an instance of the Test class
    test = Test(data_array)
    test.loop()
    

    Notes

    Note that converting a dataframe to records can be expensive if the dataframe has many columns since the internal default layout (the one used here) in Pandas is generally a dict of (Numpy) arrays. Records use a transposed layout which is only good for iterating over each line and when most fields are read. Besides, records tends to prevent any low-level vectorization, that is the use of SIMD instructions (which can make a code a lot faster), though not all code can benefit from that. For few columns, it is often better to use multiple arrays like Pandas does internally (especially with strings in it). Please read this and this for more information on Structure of Arrays (SoA) vs Array of Structures (AoS).