I have some data in a .csv file that looks like this
sent_num = [0, 1, 2]
text = [['Jack', 'in', 'the', 'box'], ['Jack', 'in', 'the', 'box'], ['Jack', 'in', 'the', 'box']]
tags = [['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG'], ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG'], ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG']]
df = pd.DataFrame(zip(sent_num, text, tags), columns=['sent_num', 'text', 'tags'])
df
I want to transform that data into CoNLL format text file like below, where each column (text and tags) is separated by a tab, and the end of each sentence (or document) is indicated by a blank line.
text tags
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG
What I have tried, but failed to work, it counts the empty rows as valid data, instead of the end of a sentence.
# create a three-column dataset
DF = df.apply(pd.Series.explode)
DF.head()
# insert space between rows in the data frame
# find the indices where changes occur
switch = DF['sent_num'].ne(DF['sent_num'].shift(-1))
# construct a new empty dataframe and shift index by .5
DF1 = pd.DataFrame('', index=switch.index[switch] + .1, columns=DF.columns)
# concatenate old and new dataframes and sort by index, reset index and remove row positions by iloc
DF2 = pd.concat([DF, DF1]).sort_index().reset_index(drop=True).iloc[:-1]
DF2.head()
group by tags
DF2[['text', 'tags']].groupby('tags').count()
I am looking for some help in modifying or improving the code I have.
with open("output.txt", "w") as f_out:
print("text\ttags", file=f_out)
for _, line in df.iterrows():
for txt, tag in zip(line["text"], line["tags"]):
print("{}\t{}".format(txt, tag), file=f_out)
print(file=f_out)
Creates output.txt
:
text tags
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG
Jack B-ORG
in I-ORG
the I-ORG
box I-ORG