pythonpandasnlpspacyconll

Problem with for loop, break statement does not do what I thought it would


This is my first time posting here, so be gentle, please.

I have written the following code:

import pandas as pd
import spacy

df = pd.read_csv('../../../Data/conll2003.dev.conll', sep='\t', on_bad_lines='skip', header=None)

nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1500000
## https://stackoverflow.com/questions/48169545/does-spacy-take-as-input-a-list-of-tokens

all_tokens = []

for token in df[0]:
    all_tokens.append(str(token))

string = ' '.join(all_tokens)

doc = nlp(string)

token_tuples = tuple(enumerate(doc))

outfile = open('./conll2003.dev.syntax_corrupt.conll', 'w')
i = 0 ## initiate by looking at the first token in the doc
for x, token in enumerate(df[0]):
    for num, tok in token_tuples[i:]: ## we add this step to ensure that the for loop always looks from the last token that was a match, since doc is longer
        ## than df[0], otherwise it would at some point start looking from earlier tokens since spacy has more tokens and if there is an accidental match, it
        ## would provide the wrong dep and head
        if token == tok.text:
            i = num ## get the number from the token tuples as new starting point
            outfile.write(str(df[0][x]) + '\t' + str(df[1][x]) + '\t' + str(df[2][x]) + '\t' + str(df[3][x]) + '\t' + str(tok.dep_) + '\t' + str(tok.head.text) + '\n')
            break
        else:
            outfile.write(str(df[0][x]) + '\t' + str(df[1][x]) + '\t' + str(df[2][x]) + '\t' + str(df[3][x]) + '\t' + 'no_dep' + '\t' + 'no_head' + '\n')
            break
outfile.close()

The code is supposed to take data from the 2003conll-shared task on NER and first join the individual tokens to a string (as the data comes pre-tokenized) and then feed it into spaCy in order to make use of its dependency parsing. After that, I want to write the same lines that were in the original file + two new columns containing the dependency relation and the respective head noun.

SpaCy obviously tokenizes the text differently than what came pre-tokenized so I had to find a way that the correct relation would be attributed to the correct token as len(doc)!= len(df[0]).

It works fine if I do not include the else statement and it writes the correct relation with the token to the outfile. However, when I do include it, I would expect it to print one line with the values "no_dep" and "no_head" (for the token spaCy did not take into account) and then continue printing the tokens where there is information on the dependency relations (because the break statement should break the loop, yeah?). But it does not. It writes to every following token "no_dep" and "no_head" instead of going back to writing the actual relations.

In other words:

inputfile (snippet):

LONDON  NNP B-NP    B-LOC
1996-08-30  CD  I-NP    O

West    NNP B-NP    B-MISC

outputfile without else statement:

LONDON  NNP B-NP    B-LOC   nmod    Simmons
West    NNP B-NP    B-MISC  nmod    Indian

what I want with the else statement:

LONDON  NNP B-NP    B-LOC   nmod    Simmons
1996-08-30      CD      I-NP    no_dep  no_head
West    NNP B-NP    B-MISC  nmod    Indian

what I get:

LONDON  NNP B-NP    B-LOC   no_dep  no_head
1996-08-30  CD  I-NP    O   no_dep  no_head
West    NNP B-NP    B-MISC  no_dep  no_head

(Note that the first line in the outputfile does have the correct dependency relation and head noun, the problem starts from the second line.)

Any ideas what it is that I'm doing wrong? Thanks!


Solution

  • You should preserve the original tokenization. To do this, manually create the Doc in order to skip the tokenizer in the pipeline:

    import spacy
    from spacy.tokens import Doc
    
    nlp = spacy.load(model)
    words = ["here", "are", "the", "original", "tokens"]
    doc = Doc(nlp.vocab, words=words)
    
    # apply the model to the doc (it skips the tokenizer for an input `Doc`)
    doc = nlp(doc)