pandasloaddata

how to load raw data in a text file in to pandas dataframe?


My data is in a text file in the format shown below:

heading1:blah

heading2:blah

heading3:blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah (text entered new line for heading3 only for this row)


heading1:blah

heading2:blah

heading3:blah blah blah blah blah blah blah blah blah blah

so on...

Note:


Solution

  • Thanks for posting the link to the data. If it's publicly available, it's helpful to do that initially. I ran this on the full data set; that took a couple of seconds on a decent laptop.

    import numpy as np
    import pandas as pd
    
    with open('rfa_all.NL-SEPARATED.txt', 'r') as f:
        data = f.readlines()
    
    # create a dictionary with keys and lists.
    # if you don't set the values as lists, you get an error.
    d = {'SRC': [], 'TGT': [], 'VOT': [],  'RES': [],  'YEA': [],  'DAT': [],  'TXT': []}
    
    for line in data: # go through file line by line
        if line != '\n': # skip new line characters
            line = line.replace('\n', '') # get rid of '\n' in all fields
            key, val = line.split(':', 1) # take the first 2 tokens from the split statement
            d[key].append(val)
    
    df = pd.DataFrame(d)
    df
    

    Extensive help from this post: https://stackoverflow.com/a/26644245/6672746

    I am sure there is a much faster way to set this up, but I think this will work.