pythonpandasread-fwf

Python pandas read_fwf strips white space


I am facing an issue using the read_fwf command from the Python library pandas, same as described in this unresolved question

I want to read an ascii file containing results of numeric computations. The file contains repetitive header blocks with a variable number of lines between each of them and with a roughly described format but rather arbitrary string content. I do not want to allow any loss of information and therefore need to be able to process everything as it was originally.

Look at this examplary line:

$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS 1181

Using a

with pd.read_fwf(self.file_path, header=None,
                         chunksize=chunk_size,  # no of lines read
                         colspecs=[(0, 23), (23, 41), (41, 59), (59, 72)],
                         engine='c', dtype={0: str}) as df_gen:

context manager to read the file, white spaces between columns will be stripped despite the seamlessly neighbouring column lengths. So it does not act as a fixed-width reader.

Cutting out the middle part of the line above with

"".join(s.dropna()).split("=")[-1].strip(" ")

with s being the row of the DataFrame containing that line, I will get a string

48-ELEMENT CANTILEVER BEAM ONTHE X-AXIS instead of

48-ELEMENT CANTILEVER BEAM ON THE X-AXIS.

Does anybody know an elegant solution to this issue? I am working on an approach where I will initially not cut into columns and do it for the numeric blocks after i extracted the header lines but it might increase computation time significantly.

Is there maybe a keyword in newer pandas versions that will disable the stripping?


Solution

  • Take bigger cuts:

    s = "$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181"
    
    pd.read_fwf(StringIO(s), header=None, colspecs=[(0, 11),(11, 51),(51,80)]
    

    Output:

                0                                         1     2
    0  $SUBTITLE=  48-ELEMENT CANTILEVER BEAM ON THE X-AXIS  1181
    

    Or you can just read in line by line:

    pd.read_table(StringIO(s), header=None)
    
    ...
                                                                                      0
    0  $SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181
    

    A little regex magic and you can consistently extract the inner line:

    re.findall('.*= (.+?)\s\s', s)
    
    ...
    
    ['48-ELEMENT CANTILEVER BEAM ON THE X-AXIS']