pythonpandasbroken-pipe

How to use read_csv in dataset with broken pipe "¦" as seprator???


I have 10 files datasets 10mb to 8gb, and I am trying to read a dataset txt with size 8gb that I cannot read because of the "¦" (Broken pipe)... Almost all that have size highest than 200mb have this same problem and the smallest files have the "normal" pipe( | ).

The code is:

p = 0.001  # % of lines

df = pd.read_csv("protectedSearch_GPRS.txt", sep='¦', 
                    skiprows= lambda i: i>0 and random.random() > p)


     ParserWarning: Falling back to the 'python' engine because the separator 
     encoded in utf-8 is > 1 char long, and the 'c' engine does not support such 
     separators; you can avoid this warning by specifying engine='python'.
       after removing the cwd from sys.path.

So What exactly is it? Is it a bug? How to handle with this problem? I am trying to solve this have 2 days and I really don't know what to do more.

Thank you very much and sorry for any language error.


Solution

  • Not a bug. The Pandas C engine only supports single-character splitting. The broken pipe is technically two UTF-8 characters, thus the error the character is > 1

    For example:

    len('¦'.encode('utf-8'))
    Out[24]: 2
    len(','.encode('utf-8'))
    Out[25]: 1
    

    If you want to suppress the warning, explicitly state which engine you wish to use:

    df = pd.read_csv(
          "protectedSearch_GPRS.txt", 
          sep='¦', 
          skiprows=lambda i: i > 0 and random.random() > p,
          engine='python'
    )
    

    Using engine=python enables matching with multi-string/regex splitting.