pythonpandasnumpydrop-duplicates

Pandas dataframe: drop_duplicates after converting to str compares truncated strings, not actual contents


I tried the suggestion in this answer, and it appears that the conversion to string before dropping duplicates results in the truncated representation being compared. It seems to me that the dataframe.astype(str) already has this truncation. How do I stop this from happening?

What I have tried:

The following code

import pandas as pd
import numpy as np

A = np.zeros((100,100))
B = np.zeros((100,100))
A[50][50] = 1
B[50][51] = 1

data = [[A, 0], [B, 0]]
dataframe = pd.DataFrame(data)
dataframe2 = dataframe.astype(str).drop_duplicates(keep=False)

results in dataframe2 being an empty dataframe, whereas this

import pandas as pd
import numpy as np

C = np.zeros((2,2))
D = np.zeros((2,2))
C[0][0] = 1
D[1][1] = 1
data = [[C, 0], [D, 0]]
dataframe = pd.DataFrame(data)
dataframe2 = dataframe.astype(str).drop_duplicates(keep=False)

gives dataframe2 being the same as dataframe. I would expect this to be the result in the first case too.

I also tried adding pd.set_option('display.max_colwidth', None), but that didn't help.


Solution

  • str(np.array) can be lossy.

    >>> print(A)
    [[0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     ...
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]]
    >>> print(B)
    [[0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     ...
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]
     [0. 0. 0. ... 0. 0. 0.]]
    

    They're the same!

    Have a look at How do I print the full NumPy array, without truncation?

    Or there may be more efficient ways to get an array into hashable form. For example have a look at Most efficient property to hash for numpy array or Fast way to Hash Numpy objects for Caching