I have 10 files datasets 10mb to 8gb, and I am trying to read a dataset txt with size 8gb that I cannot read because of the "¦" (Broken pipe)... Almost all that have size highest than 200mb have this same problem and the smallest files have the "normal" pipe( | ).
The code is:
p = 0.001 # % of lines df = pd.read_csv("protectedSearch_GPRS.txt", sep='¦', skiprows= lambda i: i>0 and random.random() > p) ParserWarning: Falling back to the 'python' engine because the separator encoded in utf-8 is > 1 char long, and the 'c' engine does not support such separators; you can avoid this warning by specifying engine='python'. after removing the cwd from sys.path.
So What exactly is it? Is it a bug? How to handle with this problem? I am trying to solve this have 2 days and I really don't know what to do more.
Thank you very much and sorry for any language error.
Not a bug. The Pandas C engine only supports single-character splitting. The broken pipe is technically two UTF-8 characters, thus the error the character is > 1
For example:
len('¦'.encode('utf-8'))
Out[24]: 2
len(','.encode('utf-8'))
Out[25]: 1
If you want to suppress the warning, explicitly state which engine you wish to use:
df = pd.read_csv(
"protectedSearch_GPRS.txt",
sep='¦',
skiprows=lambda i: i > 0 and random.random() > p,
engine='python'
)
Using engine=python
enables matching with multi-string/regex splitting.