pythonpython-3.xtexttext-parsingmemory-profiling

Parsing tables from .txt (text) files


I have some profiling results from a python profiler, shown below:

Filename: main.py

Line #    Mem usage    Increment   Line Contents
================================================
    30    121.8 MiB    121.8 MiB   @profile(stream=f)
    31                             def parse_data(data):
    32    121.8 MiB      0.0 MiB       Y=data["price"].values
    33    121.8 MiB      0.0 MiB       Y=np.log(Y)
    34    121.8 MiB      0.0 MiB       features=data.columns
    35    121.8 MiB      0.0 MiB       X1=list(set(features)-set(["price"]))
    36    126.3 MiB      4.5 MiB       X=data[X1].values
    37    126.3 MiB      0.0 MiB       ss=StandardScaler()
    38    124.6 MiB      0.0 MiB       X=ss.fit_transform(X)
    39    124.6 MiB      0.0 MiB       return X,Y


Filename: main.py

Line #    Mem usage    Increment   Line Contents
================================================
    41    127.1 MiB    127.1 MiB   @profile(stream=f)
    42                             def linearRegressionfit(Xt,Yt,Xts,Yts):
    43    127.1 MiB      0.0 MiB       lr=LinearRegression()
    44    131.2 MiB      4.1 MiB       model=lr.fit(Xt,Yt)
    45    132.0 MiB      0.8 MiB       predict=lr.predict(Xts)
    46                             

Now, I need to obtain these results for plotting and other purpose. But the text is not something very handy. The table shows line-by-line profiling results. How can I get a pandas dataframe or a tabular version which can be used to obtain any row or column from this table ?

P.S. I have visited regex and parsimonious but can't seem to get them to use in my case.


Solution

  • It is just a bit of parsing exercise. With standard split() and some minor adjustments, you can get a pretty clean data frame in a few lines of code.

    txt = '''
    Filename: main.py
    
    Line #    Mem usage    Increment   Line Contents
    ================================================
        30    121.8 MiB    121.8 MiB   @profile(stream=f)
        31                             def parse_data(data):
        32    121.8 MiB      0.0 MiB       Y=data["price"].values
        33    121.8 MiB      0.0 MiB       Y=np.log(Y)
        34    121.8 MiB      0.0 MiB       features=data.columns
        35    121.8 MiB      0.0 MiB       X1=list(set(features)-set(["price"]))
        36    126.3 MiB      4.5 MiB       X=data[X1].values
        37    126.3 MiB      0.0 MiB       ss=StandardScaler()
        38    124.6 MiB      0.0 MiB       X=ss.fit_transform(X)
        39    124.6 MiB      0.0 MiB       return X,Y
    
    
    Filename: main.py
    
    Line #    Mem usage    Increment   Line Contents
    ================================================
        41    127.1 MiB    127.1 MiB   @profile(stream=f)
        42                             def linearRegressionfit(Xt,Yt,Xts,Yts):
        43    127.1 MiB      0.0 MiB       lr=LinearRegression()
        44    131.2 MiB      4.1 MiB       model=lr.fit(Xt,Yt)
        45    132.0 MiB      0.8 MiB       predict=lr.predict(Xts)
    '''
    
    import pandas as pd
    
    lines = []
    for line in txt.split('\n'):
        #print(line)
        if line.startswith('Filename'): continue
        if line.startswith('Line'): continue
        if line.startswith('='): continue
        if line == '': continue
        data = [i.strip() for i in line.split()]
        #Fix def lines
        if data[1] == 'def':
            data = [data[0],'','','','',' '.join(data[1:4])]
    
        data = [data[0], ' '.join(data[1:3]), ' '.join(data[3:5]), data[-1]]
        lines.append(data)
    
    df = pd.DataFrame(lines, columns=['Line #', 'Mem usage', 'Increment','Line Contents'])
    
    print(df)
    
       Line #  Mem usage  Increment                            Line Contents
    0      30  121.8 MiB  121.8 MiB                       @profile(stream=f)
    1      31                                          def parse_data(data):
    2      32  121.8 MiB    0.0 MiB                   Y=data["price"].values
    3      33  121.8 MiB    0.0 MiB                              Y=np.log(Y)
    4      34  121.8 MiB    0.0 MiB                    features=data.columns
    5      35  121.8 MiB    0.0 MiB    X1=list(set(features)-set(["price"]))
    6      36  126.3 MiB    4.5 MiB                        X=data[X1].values
    7      37  126.3 MiB    0.0 MiB                      ss=StandardScaler()
    8      38  124.6 MiB    0.0 MiB                    X=ss.fit_transform(X)
    9      39  124.6 MiB    0.0 MiB                                      X,Y
    10     41  127.1 MiB  127.1 MiB                       @profile(stream=f)
    11     42                        def linearRegressionfit(Xt,Yt,Xts,Yts):
    12     43  127.1 MiB    0.0 MiB                    lr=LinearRegression()
    13     44  131.2 MiB    4.1 MiB                      model=lr.fit(Xt,Yt)
    14     45  132.0 MiB    0.8 MiB                  predict=lr.predict(Xts)
    

    You can then split the data frame when '@profile' is in 'Line Contents'.

    For example:

    split_idx = df[df['Line Contents'].str.startswith('@profile')].index
    dataframes = []
    for i, idx in enumerate(split_idx):
        try:
            dataframes.append(df.iloc[idx, split_idx[i+1]])
        except IndexError:
            dataframes.append(df.iloc[idx:])
    
    
    print(dataframes[0])
    print('======')
    print(dataframes[1])
    
       Line #  Mem usage  Increment                            Line Contents
    0      30  121.8 MiB  121.8 MiB                       @profile(stream=f)
    1      31                                          def parse_data(data):
    2      32  121.8 MiB    0.0 MiB                   Y=data["price"].values
    3      33  121.8 MiB    0.0 MiB                              Y=np.log(Y)
    4      34  121.8 MiB    0.0 MiB                    features=data.columns
    5      35  121.8 MiB    0.0 MiB    X1=list(set(features)-set(["price"]))
    6      36  126.3 MiB    4.5 MiB                        X=data[X1].values
    7      37  126.3 MiB    0.0 MiB                      ss=StandardScaler()
    8      38  124.6 MiB    0.0 MiB                    X=ss.fit_transform(X)
    9      39  124.6 MiB    0.0 MiB                                      X,Y
    10     41  127.1 MiB  127.1 MiB                       @profile(stream=f)
    11     42                        def linearRegressionfit(Xt,Yt,Xts,Yts):
    12     43  127.1 MiB    0.0 MiB                    lr=LinearRegression()
    13     44  131.2 MiB    4.1 MiB                      model=lr.fit(Xt,Yt)
    14     45  132.0 MiB    0.8 MiB                  predict=lr.predict(Xts)
    ======
       Line #  Mem usage  Increment                            Line Contents
    10     41  127.1 MiB  127.1 MiB                       @profile(stream=f)
    11     42                        def linearRegressionfit(Xt,Yt,Xts,Yts):
    12     43  127.1 MiB    0.0 MiB                    lr=LinearRegression()
    13     44  131.2 MiB    4.1 MiB                      model=lr.fit(Xt,Yt)
    14     45  132.0 MiB    0.8 MiB                  predict=lr.predict(Xts)