I used the SpaCy library to generate dependencies and save it into a CoNLL format using the code below.
import pandas as pd
import spacy
df1 = pd.read_csv('cleantweets', encoding='latin1')
df1['tweet'] = df1['tweet'].astype(str)
tweet_list = df1['tweet'].values.tolist()
nlp = spacy.load("en_core_web_sm")
for i in tweet_list:
doc = nlp(i)
for sent in doc.sents:
print('\n')
for i, word in enumerate(sent):
if word.head is word:
head_idx = 0
else:
head_idx = doc[i].head.i + 1
print("%d\t%s\t%d\t%s\t%s\t%s" % (
i+1,
word.head,
head_idx,
word.text,
word.dep_,
word.pos_,
))
This works, but there are some sentences in my dataset that get splits into two by Spacy because they have two ROOTS. This results in having two fields for one sentence in the CoNLL format.
Example: A random sentence from my dataset is : "teanna trump probably cleaner twitter hoe but"
in CoNLL format it is saved as :
1 trump 2 teanna compound
2 cleaner 4 trump nsubj
3 cleaner 4 probably advmod
4 cleaner 4 cleaner ROOT
5 hoe 6 twitter amod
6 cleaner 4 hoe dobj
1 but 2 but ROOT
Is there a way to save it all in one field instead of two even though it has two ROOTS so that 'but' becomes 7th item in field number 1? Which means it would look like this instead
1 trump 2 teanna compound
2 cleaner 4 trump nsubj
3 cleaner 4 probably advmod
4 cleaner 4 cleaner ROOT
5 hoe 6 twitter amod
6 cleaner 4 hoe dobj
7 but 2 but ROOT
I'd recommend using (or adapting) the textacy CoNLL exporter to get the right format, see: How to generate .conllu from a Doc object?
Spacy's parser is doing sentence segmentation and you're iterating over doc.sents
, so you'll see each sentence it exported separately. If you want to provide your own sentence segmentation, you can do that with a custom component, e.g.:
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "...":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
Details (especially about how to handle None
vs. False
vs. True
): https://spacy.io/usage/linguistic-features#sbd-custom
Spacy's default models aren't trained on twitter-like text, the parser probably won't perform well with respect to sentence boundaries here.
(Please ask unrelated questions as separate questions, and also take a look at spacy's docs: https://spacy.io/usage/linguistic-features#special-cases)