I'm doing a standard topic modelling task on nouns in newspaper articles using udpipe to annotate the article content. Using the function udpipe_annotate() I noticed that words together with the following punctuation mark sometimes were labelled as upos = NOUN. Thus when I run the topic model function - LDA() from library topicmodels - the most common words for a topic might include, say, 'product' and 'product.', the latter including the punctuation mark. They should be seen as the same word. How can I remedy this and remove the punctuation?
Another issue is when words before a punctuation were labelled as upos = PUNCT. E.g. 'energy' and 'energy,' were labelled differently. Thus I have to specify that I want to include PUNCT in the analysis, and even then I run into the same problem as above of the algorithm treating this as two different words. Is this a problem with the udpipe annotation or is there an easy fix to this problem?
EDIT: Adding code example using first two sentences of wikipedia article on Norway in Norwegian:
text <- c('Norge, offisielt Kongeriket Norge, er et nordisk, europeisk land og en selvstendig stat vest på Den skandinaviske halvøy. Geografisk sett er landet langt og smalt.', 'På den langstrakte kysten mot Nord-Atlanteren befinner Norges vidkjente fjorder seg.', 'Kongeriket Norge omfatter hovedlandet (fastlandet med tilliggende øyer innenfor grunnlinjen), Jan Mayen og Svalbard.')
id <- c(1:3)
df <- data.frame(text, id)
ud_model <- udpipe_download_model(language = "norwegian-bokmaal")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = df$text, doc_id = df$id)
x_df = data.frame(x)
Showing example of the problematic outputs (the rest (ADJ, VERB, etc) are fine I think):
head(x_df[x_df$upos=='NOUN',5:8], 5)
OUTPUT:
token_id | token | lemma | upos |
---|---|---|---|
1 | Norge, | norge, | NOUN |
4 | Norge, | norge, | NOUN |
9 | land | land | NOUN |
13 | stat | stat | NOUN |
18 | halvøy. | halvøy. | NOUN |
head(x_df[x_df$upos=='PUNCT',5:8])
The words with token_id 1,4,and 18 are not correct.
OUTPUT:
token_id | token | lemma | upos |
---|---|---|---|
7 | nordisk, | $nordisk, | PUNCT |
10 | grunnlinjen), | $grunnlinjen), | PUNCT |
Here, udpipe is finding the punctuation but it also includes the preceding word.
EDIT2: The problem does not occur for me with the French or English language models. Nor does it seem to appear on the norwegian-nynorsk version.
Looks like there is an issue with the norwegian-bokmaal ud 2.5 model. Looking at the ud treebank for norwegian bokmal they are already on version 2.10.
If you use either norwegian-nynorks it works correctly or norwegian-bokmaal ud 2.4 model.
# switch to older model
ud_model <- udpipe_download_model(language = "norwegian-bokmaal",
udpipe_model_repo = "jwijffels/udpipe.models.ud.2.4")
# nynorsk works as well
ud_model <- udpipe_download_model(language = "norwegian-nynorsk")
You can, of course, get version 2.10, but then you have to train your udpipe model yourself. More info about this in the Model Building vignette.