rtmudpipe

Using content_transformer with udpipe_annotate


so I just found out that udpipe has an awesome way of showing correlations, so I started working on it. The code from this site works perfect if I use it on the csv file after importing it and don't make any changes on it.

But my problem occurs as soon as I create a corpus and I change/remove some words. I'm no expert in R, but I've googled so much and I can't seem to figure it out.

Here is my code:

txt <- read_delim(fileName, ";", escape_double = FALSE, trim_ws = TRUE)

# Maak Corpus
docs <- Corpus(VectorSource(txt))
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords('nl'))
docs <- tm_map(docs, removeWords, myWords())
docs <- tm_map(docs, content_transformer(gsub), pattern = "afspraak|afspraken|afgesproken", replacement = "afspraak")
docs <- tm_map(docs, content_transformer(gsub), pattern = "communcatie|communiceren|communicatie|comminicatie|communiceer|comuniseren|comunuseren|communictatie|comminiceren|comminisarisacie|communcaite", replacement = "communicatie")
docs <- tm_map(docs, content_transformer(gsub), pattern = "contact|kontact|kontakt", replacement = "contact")

comments <- docs

library(lattice)
stats <- txt_freq(x$upos)
stats$key <- factor(stats$key, levels = rev(stats$key))
#barchart(key ~ freq, data = stats, col = "cadetblue", main = "UPOS (Universal Parts of Speech)\n frequency of occurrence", xlab = "Freq")

## NOUNS (zelfstandige naamwoorden)
stats <- subset(x, upos %in% c("NOUN")) 
stats <- txt_freq(stats$token)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Most occurring nouns", xlab = "Freq")

## ADJECTIVES (bijvoeglijke naamwoorden)
stats <- subset(x, upos %in% c("ADJ")) 
stats <- txt_freq(stats$token)
stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Most occurring adjectives", xlab = "Freq")

## Using RAKE (harkjes)
stats <- keywords_rake(x = x, term = "lemma", group = "doc_id", relevant = x$upos %in% c("NOUN", "ADJ"))
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ rake, data = head(subset(stats, freq > 3), 20), col = "cadetblue", main = "Keywords identified by RAKE", xlab = "Rake")

## Using Pointwise Mutual Information Collocations
x$word <- tolower(x$token)
stats <- keywords_collocation(x = x, term = "word", group = "doc_id")
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ pmi, data = head(subset(stats, freq > 3), 20), col = "cadetblue", main = "Keywords identified by PMI Collocation", xlab = "PMI (Pointwise Mutual Information)")

## Using a sequence of POS tags (noun phrases / verb phrases)
x$phrase_tag <- as_phrasemachine(x$upos, type = "upos")
stats <- keywords_phrases(x = x$phrase_tag, term = tolower(x$token), pattern = "(A|N)*N(P+D*(A|N)*N)*", is_regex = TRUE, detailed = FALSE)
stats <- subset(stats, ngram > 1 & freq > 3)
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
barchart(key ~ freq, data = head(stats, 20), col = "cadetblue", main = "Keywords - simple noun phrases", xlab = "Frequency")


cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                                         term = "lemma", 
                                         group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)
library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooc, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
    geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
    theme_graph(base_family = "Arial Narrow") +
    theme(legend.position = "none") +
    labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")

As soon as I convert the imported file to corpus, it fails. Anyone know how I can still execute the tm_map functions and then run the udpipe code?

Tnx in advance!


Solution

  • There are multiple solutions to what you want. But since your corpus is created with vectorsource, it is just one long vector of inputs. This you can very easily get back into a vector so udpipe can take over.

    In udpipe example documents everything is defined as x so I will do the same. After cleaning your corpus, just do:

    x <- as.character(docs[1])
    

    The [1] after docs is important otherwise you get some additional characters you don't need. Once this is done, run the udpipe commands to turn the vector into the data.frame you need.

    x <- udpipe_annotate(ud_model, x)
    x <- as.data.frame(x)
    

    Another way is to first write the corpus (check ?writeCorpus for more info) to disk and then read the cleaned file(s) in again and put it through udpipe. This is more of a workaround, but might result in a better workflow.

    Also udpipe handles punctuation, it puts in in special upos class called PUNCT with an xpos description (in Dutch if you use the Dutch model) Punc|komma or unc|punt. If the noun has a capital letter, the lemma will be lowercase.

    In your case I would just use the basic regex options to go through the data instead of using tm. The Dutch stopwords just remove some verbs like "zijn", "worden" en "kunnen" en some adposition as "te" and pronouns as "ik" and "we". These you filter in out in your udpipe code anyway as you only look at nouns and adjectives.