I have a processed dataframe which is used as a input to train a NLP model:
sentence_id words labels
0 0 a B-ORG
1 0 b I-ORG
2 0 c I-ORG
5 1 d B-ORG
6 1 e I-ORG
7 2 f B-PER
8 2 g I-PER
I need to convert this into ConLL text format as below:
a B-ORG
b I-ORG
c I-ORG
d B-ORG
e I-ORG
f B-PER
g I-PER
The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.
Anyone have any idea how to do that?
First join both columns by space anf then in DataFrame.groupby
add last empty value with write to file:
df['join'] = df['words'] + ' ' + df['labels']
#alternative
#df['join'] = df['words'].str.cat(df['labels'], sep=' ')
for i, g in df.groupby('sentence_id')['join']:
out = g.append(pd.Series({'new':np.nan}))
out.to_csv('file.txt', index=False, header=None, mode='a')