pythonpandasdataframetext-files

Pandas/Python read file with different separators


I have a .txt file as follows:

columnA;columnB;columnC;columnD
2022040200000000000000000000011    8000702   79005889  SPECIAL_AGENCY

You can observe that the names of the columns are separated by a semi column ;, however, row values, have different separators. In this example, columnA has 3 spaces, columnB has 3, columnC has 2, and columnD has 7.

It is important to clarify, that I need to keep the spaces, hence the “real” separator is the last space.

Considering I have a schema, that tells me for each column what is the amount of spaces (separators?) I have, how can I turn it into a pandas dataframe?


Solution

  • One way is to use a double regex separator with (|) and pandas.read_csv :

    df = pd.read_csv("/tmp/file.txt", sep=";|(?<=\d)\s+(?=\B)", engine="python")
    

    Output :

    print(df)
    ​
                               columnA  columnB   columnC                       columnD
    0  2022040200000000000000000000011  8000702  79005889   SPECIAL_AGENCY       LONDON
    

    NB: If needed, you can add pandas.Series.replace to clean up the extra (\s) in the columnD.