rnlpannotationspunctuationudpipe

udpipe_annotate() in r labels the same word differently if followed by punctuation


I'm doing a standard topic modelling task on nouns in newspaper articles using udpipe to annotate the article content. Using the function udpipe_annotate() I noticed that words together with the following punctuation mark sometimes were labelled as upos = NOUN. Thus when I run the topic model function - LDA() from library topicmodels - the most common words for a topic might include, say, 'product' and 'product.', the latter including the punctuation mark. They should be seen as the same word. How can I remedy this and remove the punctuation?

Another issue is when words before a punctuation were labelled as upos = PUNCT. E.g. 'energy' and 'energy,' were labelled differently. Thus I have to specify that I want to include PUNCT in the analysis, and even then I run into the same problem as above of the algorithm treating this as two different words. Is this a problem with the udpipe annotation or is there an easy fix to this problem?

EDIT: Adding code example using first two sentences of wikipedia article on Norway in Norwegian:

text <- c('Norge, offisielt Kongeriket Norge, er et nordisk, europeisk land og en selvstendig stat vest på Den skandinaviske halvøy. Geografisk sett er landet langt og smalt.', 'På den langstrakte kysten mot Nord-Atlanteren befinner Norges vidkjente fjorder seg.', 'Kongeriket Norge omfatter hovedlandet (fastlandet med tilliggende øyer innenfor grunnlinjen), Jan Mayen og Svalbard.')

id <- c(1:3)

df <- data.frame(text, id)

ud_model <- udpipe_download_model(language = "norwegian-bokmaal")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = df$text, doc_id = df$id)
x_df = data.frame(x)

Showing example of the problematic outputs (the rest (ADJ, VERB, etc) are fine I think):

head(x_df[x_df$upos=='NOUN',5:8], 5)

OUTPUT:

token_id token lemma upos
1 Norge, norge, NOUN
4 Norge, norge, NOUN
9 land land NOUN
13 stat stat NOUN
18 halvøy. halvøy. NOUN

head(x_df[x_df$upos=='PUNCT',5:8])

The words with token_id 1,4,and 18 are not correct.

OUTPUT:

token_id token lemma upos
7 nordisk, $nordisk, PUNCT
10 grunnlinjen), $grunnlinjen), PUNCT

Here, udpipe is finding the punctuation but it also includes the preceding word.

EDIT2: The problem does not occur for me with the French or English language models. Nor does it seem to appear on the norwegian-nynorsk version.


Solution

  • Looks like there is an issue with the norwegian-bokmaal ud 2.5 model. Looking at the ud treebank for norwegian bokmal they are already on version 2.10.

    If you use either norwegian-nynorks it works correctly or norwegian-bokmaal ud 2.4 model.

    # switch to older model
    ud_model <- udpipe_download_model(language = "norwegian-bokmaal", 
                                      udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4")
    
    # nynorsk works as well
    ud_model <- udpipe_download_model(language = "norwegian-nynorsk")
    

    You can, of course, get version 2.10, but then you have to train your udpipe model yourself. More info about this in the Model Building vignette.