The below dataset consists of sentences wherein every word is labelled individually. I want to split this into two variables to train my model. The records are separated by an empty line and every record spans multiple lines where the word and label are comma separated.
how,SW
is,SW
the,SW
weather,WTR
?,.
# blank line
will,SW
it,SW
rain,RAIN
this,ADJ
weekend,TIME
?,.
I want to process this input file to generate the expected output as shown below:
X variable must contain all words of every record as individual lists:
[[how, is, the, weather, ?], [will it rain this weekend, ?]]
Y variable must contain labels of every record as individual lists:
[[SW, SW, SW, WTR, .], [SW, SW, RAIN, ADJ, TIME, .]]
Please suggest. Thank you!
Probably something like this would work:
Xs = []
Ys = []
with open('file.txt', 'r') as f:
lines = f.readlines()
i = 0
X = []
Y = []
for line in lines:
line = line.strip()
if line == "":
Xs.append(X)
Ys.append(Y)
X,Y = [],[]
else:
x,y = line.split(",")
X.append(x)
Y.append(y)
Xs.append(X)
Ys.append(Y)
print(Xs)
print(Ys)
#[['how', 'is', 'the', 'weather', '?'], ['will', 'it', 'rain', 'this', 'weekend', '?']]
#[['SW', 'SW', 'SW', 'WTR', '.'], ['SW', 'SW', 'RAIN', 'ADJ', 'TIME', '.']]
The code basically goes open the file, read all the lines, and iterate through the lines to check if we've finished importing a record (as indicated by an empty line) and act accordingly. line.strip()
removes all whitespace from the line, so "\n".strip()
would output ""
.