pythonpandasete3

Get a table from a print output (pandas)


I ran a programme called codeml implemented in the python package ete3.

Here is the print of the model generated by codeml :

>>> print(model)
 Evolutionary Model fb.cluster_03502:
        log likelihood       : -35570.938479
        number of parameters : 23
        sites inference      : None
        sites classes        : None
        branches             : 
        mark: #0  , omega: None      , node_ids: 8   , name: ROOT
        mark: #1  , omega: 789.5325  , node_ids: 9   , name: EDGE
        mark: #2  , omega: 0.005     , node_ids: 4   , name: Sp1
        mark: #3  , omega: 0.0109    , node_ids: 6   , name: Seq1
        mark: #4  , omega: 0.0064    , node_ids: 5   , name: Sp2
        mark: #5  , omega: 865.5116  , node_ids: 10  , name: EDGE
        mark: #6  , omega: 0.005     , node_ids: 7   , name: Seq2
        mark: #7  , omega: 0.0038    , node_ids: 11  , name: EDGE
        mark: #8  , omega: 0.067     , node_ids: 2   , name: Sp3
        mark: #9  , omega: 999.0     , node_ids: 12  , name: EDGE
        mark: #10 , omega: 0.1165    , node_ids: 3   , name: Sp4
        mark: #11 , omega: 0.1178    , node_ids: 1   , name: Sp5

But since it is only a print, I would need to get these informations into a table such as :

Omega       node_ids       name 
None        8              ROOT
789.5325    9              EDGE
0.005       4              Sp1
0.0109      6              Seq1
0.0064      5              Sp2
865.5116    10             EDGE
0.005       7              Sp3
0.0038      11             EDGE
0.067       2              Sp3
999.0       12             EDGE
0.1165      3              Sp4
0.1178      1              Sp5

Because I need to parse these informations.

Do you have an idea how to handle a print output ?

Thanks for your help.


Solution

  • I took a look at the underlying code in model.py

    It seems that you can use s = model.__str__() to obtain a string of this print-out. From there you can parse the string using standard string operations. I don't know the exact form of your string, but your code could look something like this:

    import pandas as pd
    
    lines = s.split('\\n')
    
    lst = []
    first_idx = 6  # Skip the lines that are not of interest.
    names = [field[:field.index(':')].strip() for field in lines[first_idx].split(',')]
    
    for line in lines[first_idx:]:  
        if line:
            row = [field[field.index(':')+1:].strip().strip("#") for field in line.split(',')]
            lst.append(row)
    
    df = pd.DataFrame(lst, columns=names)
    

    There are prettier ways to do this, but it gets the job done.