pythonbinarypython-polars

Custom decode binary data in polars


when working with binary data, I am using custom function in order to decode them. This requires the usage of apply in polars. Due to the element wise processing in this case, the calculation time is increasing significantly, when working with larg data sets.

I tried to cast the binary data to List(UInt8), but this is not yet implemented.

exceptions.ArrowErrorException: NotYetImplemented("Casting from LargeBinary to LargeList(Field { name: \"item\", data_type: UInt8, is_nullable: true, metadata: {} }) not supported")

Is there a more efficiant way of doing it?

import polars as pl
import struct
import io

data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
schema = {"binary": pl.Binary, "id":pl.Int16}

df = pl.DataFrame(data, schema)

This returns:

shape: (2, 2)
┌───────────────┬─────┐
│ binary        ┆ id  │
│ ---           ┆ --- │
│ binary        ┆ i16 │
╞═══════════════╪═════╡
│ [binary data] ┆ 1   │
│ [binary data] ┆ 2   │
└───────────────┴─────┘

Now when we apply our function to decode the binary column:

def custom_decode(data):
   bytestream = io.BytesIO(data)
   lst = []

   while bytestream.tell() < 6:
      lst.append(struct.unpack('<H', bytestream.read(2))[0])

   return lst

df = df.with_columns(
      pl.col('binary').map_elements(lambda x: custom_decode(x))
   )

Result:

shape: (2, 2)
┌─────────────────┬─────┐
│ binary          ┆ id  │
│ ---             ┆ --- │
│ list[i64]       ┆ i16 │
╞═════════════════╪═════╡
│ [253, 254, 255] ┆ 1   │
│ [16, 32, 48]    ┆ 2   │
└─────────────────┴─────┘

Solution

  • I have added the cast upstream. In the next release of polars; polars>=0.18.1, you can do this:

    data = {"binary": [b'\xFD\x00\xFE\x00\xFF\x00',b'\x10\x00\x20\x00\x30\x00'], "id": [1,2]}
    schema = {"binary": pl.Binary, "id":pl.Int16}
    
    (
        pl.DataFrame(data, schema)
        .with_columns(
            pl.col("binary").cast(pl.List(pl.UInt8))
        )
    )
    
    shape: (2, 2)
    ┌───────────────┬─────┐
    │ binary        ┆ id  │
    │ ---           ┆ --- │
    │ list[u8]      ┆ i16 │
    ╞═══════════════╪═════╡
    │ [253, 0, … 0] ┆ 1   │
    │ [16, 0, … 0]  ┆ 2   │
    └───────────────┴─────┘