pythonpython-3.xpandasdataframeread-write

Read text files with multiple lists with spacings and commas exist between elements in the lists into pandas dataframe


I have a text file called tropical.txt that have multiple lists and each list is separated by a new line. Notice the comma is surrounded by spaces.

 space here and space here
         | |
['papaya' , 'mangosteen' , 'banana']
[]
['coconut' , 'mango']
['mangosteen' , 'papaya']

I tried the following code

import pandas as pd

df = pd.read_csv('tropical.txt', sep='\n', header=None, engine = 'python')
df

which gives me

ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

If I were to just do

import pandas as pd

df = pd.read_csv('tropical.txt', header= None, engine = 'python')
df

The output isn't what I wanted

         0           1             2
0   ['papaya'   'mangosteen'    'banana']
1   []               None        None
2   ['coconut'      'mango']     None
3   ['mangosteen'   'papaya']    None


I am expecting


                        0   
0   [papaya,mangosteen,banana]
1   []  
2   [coconut,mango] 
3   [mangosteen,papaya]


Any suggestion?


Solution

  • You can use read_csv, by specifying a separator which will not occur in the lines (e.g. \0) (so that each line will be read as a whole) and ast.literal_eval as a converter for the values:

    import ast
    
    pd.read_csv('tropical.txt', header=None, sep='\0', names=['fruits'], converters={ 'fruits' : ast.literal_eval })
    

    Output:

                             fruits
    0  [papaya, mangosteen, banana]
    1                            []
    2              [coconut, mango]
    3          [mangosteen, papaya]