I tried the suggestion in this answer, and it appears that the conversion to string before dropping duplicates results in the truncated representation being compared. It seems to me that the dataframe.astype(str)
already has this truncation. How do I stop this from happening?
What I have tried:
The following code
import pandas as pd
import numpy as np
A = np.zeros((100,100))
B = np.zeros((100,100))
A[50][50] = 1
B[50][51] = 1
data = [[A, 0], [B, 0]]
dataframe = pd.DataFrame(data)
dataframe2 = dataframe.astype(str).drop_duplicates(keep=False)
results in dataframe2
being an empty dataframe, whereas this
import pandas as pd
import numpy as np
C = np.zeros((2,2))
D = np.zeros((2,2))
C[0][0] = 1
D[1][1] = 1
data = [[C, 0], [D, 0]]
dataframe = pd.DataFrame(data)
dataframe2 = dataframe.astype(str).drop_duplicates(keep=False)
gives dataframe2
being the same as dataframe
. I would expect this to be the result in the first case too.
I also tried adding pd.set_option('display.max_colwidth', None)
, but that didn't help.
str(np.array)
can be lossy.
>>> print(A)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
>>> print(B)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
They're the same!
Have a look at How do I print the full NumPy array, without truncation?
Or there may be more efficient ways to get an array into hashable form. For example have a look at Most efficient property to hash for numpy array or Fast way to Hash Numpy objects for Caching