I am sorry if this is a duplicate, but I didn't find a suitable answer for this problem.
If have a bytes object in python, like this:
b'\n\x00\x00\x00\x01\x00\x00\x00TEST\xa2~\x08A\x83\x11\xe3@\x05\x00\x00\x00\x03\x00\x00\x00TEST\x91\x9b\xd1?\x1c\xaa,@'
It contains first a certain number of integer (4bytes) then a string with 4 characters and then a certain number of floats (4bytes).
This is repeated a certain number of times which each correspond to a new row of data. The format of each row is the same and known. In the example this 2 rows of 2 integers, 1 string and 2 floats.
My question is, if there is a way to convert this kind of data to a pandas DataFrame
directly.
My current approach was to first read all values (e.g. with struct.Struct.unpack
) and place them in a list of lists. This however seem rather slow, especially for a large number of rows.
This works fine for me:
import numpy as np
import pandas as pd
data = b'\n\x00\x00\x00\x01\x00\x00\x00TEST\xa2~\x08A\x83\x11\xe3@\x05\x00\x00\x00\x03\x00\x00\x00TEST\x91\x9b\xd1?\x1c\xaa,@'
dtype = np.dtype([
('int1', np.int32),
('int2', np.int32),
('string', 'S4'),
('float1', np.float32),
('float2', np.float32),
])
structured_array = np.frombuffer(data, dtype=dtype)
df = pd.DataFrame(structured_array)
df['string'] = df['string'].str.decode('utf-8')
print(df)
And it gives me this following output:
int1 | int2 | string | float1 | float2 | |
---|---|---|---|---|---|
0 | 10 | 1 | TEST | 8.530916 | 7.095888 |
1 | 5 | 3 | TEST | 1.637560 | 2.697883 |