My data is in a text file in the format shown below:
heading1:blah
heading2:blah
heading3:blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah (text entered new line for heading3 only for this row)
heading1:blah
heading2:blah
heading3:blah blah blah blah blah blah blah blah blah blah
so on...
Note:
Thanks for posting the link to the data. If it's publicly available, it's helpful to do that initially. I ran this on the full data set; that took a couple of seconds on a decent laptop.
import numpy as np
import pandas as pd
with open('rfa_all.NL-SEPARATED.txt', 'r') as f:
data = f.readlines()
# create a dictionary with keys and lists.
# if you don't set the values as lists, you get an error.
d = {'SRC': [], 'TGT': [], 'VOT': [], 'RES': [], 'YEA': [], 'DAT': [], 'TXT': []}
for line in data: # go through file line by line
if line != '\n': # skip new line characters
line = line.replace('\n', '') # get rid of '\n' in all fields
key, val = line.split(':', 1) # take the first 2 tokens from the split statement
d[key].append(val)
df = pd.DataFrame(d)
df
Extensive help from this post: https://stackoverflow.com/a/26644245/6672746
I am sure there is a much faster way to set this up, but I think this will work.