python numpy tensorflow scikit-learn data-processing

np.load fails with ValueError: cannot reshape array of size (838715,) into shape (838710,)

I'm trying to save the scaling parameters of a dataset into a .npy file on the disk, so I avoid having to recalculate them every time I re-run the code.

For now, I'm using MaxAbsScaler() from sklearn and I save the scaler's max_abs_ property to a .npy file as well as the list of remaining files to process into another .npy file, so I can continue running the code from the last saved state. The files I'm processing are represented by a series of FFT amplitude features which I want to scale in [-1,1] and for that I need to apply partial_fit() because their total size exceeds my RAM. Moreover, I do not use a .reshape in my code at all.

The problem is: Sometimes (not always) when I run the code that computes a scaler, I would a get an error like the one suggested in the title.

I tried to output the length of the remaining list of files before saving it to the disk to see what was happening, however, there was seemingly no discrepancy, even though I would have expected that to happen.

I then tried to convert my list into a np array, but I kept facing the same error.

Can't really share much of the relevant logic behind, since it's private, but long story short it's something like:

remaining_paths = np.load('remaining_files.npy', allow_pickle=True)
list_of_paths = remaining_paths
for path in list_of_paths:
    data = np.load(path,allow_pickle=True)
    scaler.partial_fit(data)
    remaining_paths.remove(path)
    np.save(scaler.max_abs_,'max_abs.npy',allow_pickle=True)
    remaining_paths.remove(path)
    np.save(remaining_paths ,'remaining_files.npy',allow_pickle=True)

Solution

I had the same issue myself and np.load is sometimes funny when it comes to clarity of errors. In my case, it was because the file got corrupted/the process was interrupted while still writing to the file.

The short answer is:

Try to have a backup file or something similar: one input file and one output file. After the output file was written, overwrite the input file with its contents - maybe do this at every 5 steps or so.

remaining_paths = np.load('remaining_files.npy', allow_pickle=True)
list_of_paths = remaining_paths
ctr = 0
steps_to_save = 5 
for path in list_of_paths:
    data = np.load(path,allow_pickle=True)
    scaler.partial_fit(data)
    remaining_paths.remove(path)
    np.save(scaler.max_abs_,'max_abs.npy',allow_pickle=True)
    remaining_paths.remove(path)
    np.save(remaining_paths ,'remaining_files_out.npy',allow_pickle=True)
    if i % steps_to_save == 0: 
            shutil.copyfile('remaining_files_out.npy','remaining_files_in.npy')
    ctr+=1

You can refer to shutil documentation or to this answer on the different type of file copy/moving/overwriting options.

The long answer is:

If you dig deeper into the code of np.load, it will try to reshape the input array to the shape specified in the header dictionary of the .npy file.

If you want a hotfix - this may or may not work, depending on where the file writer stopped: use a hex editor like this and change whatever value is in the file with the other value from your error. In my case, it was 839915 to 839910, as in the image:

Again, this only works if you're lucky enough that the rest of the file was correctly written and only the shape is wrong, but I would not advise it - who knows in what other ways can the data get corrupted and still be readable?

Please refer to the npy format documentation for more details on how npy files are built/structured. https://numpy.org/devdocs/reference/generated/numpy.lib.format.html