pythonpandasmarkdownpdf-parsing

Unable to convert text into dataframe in python


I am trying to convert a text into a dataframe using Python.

sample_text: 'This is \nsample text\n\nName|age\n--|--\n1.abc|45\n2.xyz|34'

Final Desired output:

enter image description here

Steps I am following to achieve the above output are listed below :

  1. Break the text into multiple rows and assign it to a variable: I have tried using print() to process this text formatted_text = print('This is \nsample text\n\nName|age\n--|--\n1.abc|45\n2.xyz|34') but it cant be assigned as print() returns NoneType, so I get an error here.

Desired output after this step:

This is 
sample text

Name|age
--|--
1.abc|45
2.xyz|34
  1. Use the above line break text stored in a variable to be read as a CSV with the separator | to create a dataframe: I have been thinking of processing this as pd.read_csv(formatted_text,sep='|', skipinitialspace=True)

Desired_output after this step:

enter image description here

I tried earlier explaining this problem in SO post but I guess I wasn't able to explain it well and it got closed. I Hope I am able to explain my issue this time. It could be a silly task but I have been stuck at this for a long time now and would appreciate any help.


Solution

  • A possible solution:

    text = 'This is \nsample text\n\nName|age\n--|--\n1.abc|45\n2.xyz|34'
    
    pd.read_csv(StringIO(text), lineterminator='\n', engine='c', header=None)
    

    Output:

                 0
    0     This is 
    1  sample text
    2     Name|age
    3        --|--
    4     1.abc|45
    5     2.xyz|34
    

    To split the columns, we can use str.split after read_csv:

    (pd.read_csv(StringIO(text), lineterminator='\n', engine='c', header=None)[0]
     .str.split('|', expand=True))
    

    Output:

                 0     1
    0     This is   None
    1  sample text  None
    2         Name   age
    3           --    --
    4        1.abc    45
    5        2.xyz    34