[SOLVED] How to use read_csv in dataset with broken pipe "¦" as seprator???

How to use read_csv in dataset with broken pipe "¦" as seprator???

I have 10 files datasets 10mb to 8gb, and I am trying to read a dataset txt with size 8gb that I cannot read because of the "¦" (Broken pipe)... Almost all that have size highest than 200mb have this same problem and the smallest files have the "normal" pipe( | ).

The code is:

p = 0.001  # % of lines

df = pd.read_csv("protectedSearch_GPRS.txt", sep='¦', 
                    skiprows= lambda i: i>0 and random.random() > p)


     ParserWarning: Falling back to the 'python' engine because the separator 
     encoded in utf-8 is > 1 char long, and the 'c' engine does not support such 
     separators; you can avoid this warning by specifying engine='python'.
       after removing the cwd from sys.path.

So What exactly is it? Is it a bug? How to handle with this problem? I am trying to solve this have 2 days and I really don't know what to do more.

Thank you very much and sorry for any language error.

Solution

Not a bug. The Pandas C engine only supports single-character splitting. The broken pipe is technically two UTF-8 characters, thus the error the character is > 1

For example:

len('¦'.encode('utf-8'))
Out[24]: 2
len(','.encode('utf-8'))
Out[25]: 1

If you want to suppress the warning, explicitly state which engine you wish to use:

df = pd.read_csv(
      "protectedSearch_GPRS.txt", 
      sep='¦', 
      skiprows=lambda i: i > 0 and random.random() > p,
      engine='python'
)

Using engine=python enables matching with multi-string/regex splitting.