Storing data not aligned to bytes without padding in Python (NumPy)

I have image data that is represented with 10 or 12-bit integers and I would like to save this data to disk without writing the unnecessary 6 or 4 zeros of padding when using 16-bit integer to represent them. All of this ideally in Python.

To be more specific - if I have 2 12-bit integers stored as 16-bit integers, they will take 32 bits of space, but the data itself only covers 24 bits. If I was able to store them as 3 8-bit integers, there would be no waste of space with padding. Only necessary thing is to remember the shape of the original input array (image resolution) and bit depth to be able to restore the original data.

Source of the data is using NumPy nd-arrays with np.int16 dtype with the most significant bits being set to zero as padding. Destination of the data is an HDF5 file, that is manipulated with h5py module that also uses NumPy nd-arrays as the underlying datatype. Therefore, I would find it best, if the operation could be done with NumPy itself without any additional overhead of casting to a different datatype. But any suggested solution is welcome.

I was not able to find any solution to this problem that would be at least decently efficient. Using simple bit masking operations to somehow divide and combine the numbers is probably not feasible since it would take more CPU time than it would save disk-writing time. This assumption comes from the expectation that this operation would need to be done element-wise without any vectorization/other optimization possible.

Therefore, some optimized function is probably necessary but I haven't found any. I am dealing with data (video) with size in hundreds of megabytes per second, so this optimization would really make it easier for my SSD to manage to store the data.

Solution

I don't know if this is very efficient, or what you are looking for. But you could simply flatten the array then do some basic shift/mask (or *16 and +). To create a np.uint8 array that you can store with any method you know to store arrays. Or to recreate back your "16 bits representing 12 bits array"

It is certainly not very efficient. But at least it is vectorized (not "element-wise", at least, if by that you meant "in pure python iteration performing pure python action for each element". I mean, it is still element-wise, but in numpy iterations. I don't see how could it not be element-wise anyway: you have to read all elements at least once to do anything)

Like this

def storable8from12(arr):
   tmp32=arr.reshape(-1,2).astype(np.uint32)
   return ((tmp32[:,0]<<12)+tmp32[:,1]).view(np.uint8).reshape(-1,4)[:,:3].reshape(len(arr),-1)

Some temporary arrays even bigger than the one you want to reduce. But no for loop. And I take that this is for storage, not directly in memory, that you want to reduce the size.

Then, the otherway around

def unpack12bitsFrom8(st):
   tmp32=np.pad(st.reshape(-1,3), ((0,0),(0,1))).view(np.uint32)
   odd=(tmp32&0xfff000)>>12
   even=tmp32&0xfff
   return np.hstack([odd,even]).reshape(len(st),-1).astype(np.uint16)

So, again, not sure this is what you were expected. Since you obviously already thought of bitwise manipuation (<<, & and stuff). But I don't see why you say it is not "vectorized" (in the sense we usually give to this word in numpy: the for loops are done inside numpy)

Note that here, I am assuming that the size of a row is such as there is not exception in the border. So, in the dealt case of "16 carrying 12 bits -> 8 bits -> 16 bits carrying 12 bits" case, that means that the length of a row is even. Otherwise, obviously, it is still possible, by padding the last pixel of each row. Or even doing this globally on the whole image. But then you probably need to store the shape (in my case, I don't need too, because I assume that W×H "16 carrying 12" image translate into a 1.5W×H 8 bits image).

I haven't covered the 10 bits case. If you don't want to just waste 2 bits and deal it like the 12 bits case, then you'll need to group by packets of the closest common multiple of 10 and 8. That would be 40 bits. So roughly same technique, but a bit more painful (you need 64 bits temporary integer, packing 4 numbers, not 2, into one. And on the way back, you need 4 columns, not just odd and even, to hstack). Which means that W resolution has this time to be a multiple of 4. Or, again, that you will need some padding, and to store the shape, if you can't assume that W×H 10 bits is stored as (1.25W)×H 8 bits.