pythonpandasfilecheminformatics

How to create a Pandas df from a haphazard .dat file?


I have a .dat file that looks like this.

6.74E+01  "methane"                                        "74-82-8"     "L"
5.06E+01  "ethane"                                         "74-84-0"     "L"
7.16E+01  "propane"                                        "74-98-6"     "L"
9.59E+01  "butane"                                         "106-97-8"    "L"
1.20E+02  "2-methylpropane"                                "75-28-5"     "L"
3.73E+02  "dimethylpropane"                                "463-82-1"    "L"
1.25E+02  "pentane"                                        "109-66-0"    "L"

This .dat file appears to be haphazardly created. As far as I can tell, the columns are separated by varying numbers of spaces. Further down the file, some rows also have one extra column for comments. I need to read this into a Pandas dataframe. I have tried...

raw = pd.read_table(r'FILE PATH')
raw.columns = ['Value', 'Name', 'Numbers', 'Letter']

Which then throws an error saying "Exception has occurred: ValueError Length mismatch: Expected axis has 1 elements, new values have 4 elements"

I was expecting an error, but this makes it look like there is only 1 column. I am totally at a loss and I hope someone can help. Thanks

Edit: The extra columns have a single space of separation.

1.01E-02  "2,3-benzindene"                                 "86-73-7"     "M" ! fluorene

Solution

  • Assuming that columns are defined by runs of whitespace, you can use the delim_whitespace=True argument of read_table.

    I assume that the file does not contain a header line. By specifying the column names through the names argument, you avoid a) that the first line is interpreted as a header line and b) that the parser is confused by the "extra columns".

    raw = pd.read_table(filename, delim_whitespace=True,
                        names=['Value', 'Name', 'Numbers', 'Letter'])
    

    Result of print(raw):

          Value             Name   Numbers Letter
    0   67.4000          methane   74-82-8      L
    1   50.6000           ethane   74-84-0      L
    2   71.6000          propane   74-98-6      L
    3   95.9000           butane  106-97-8      L
    4  120.0000  2-methylpropane   75-28-5      L
    5  373.0000  dimethylpropane  463-82-1      L
    6  125.0000          pentane  109-66-0      L
    7    0.0101   2,3-benzindene   86-73-7      M