nlpspacyquanteda

Is there a way to keep between-word hyphens when lemmatizing using spacyr?


I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequenecy, some of which are hyphenated terms.

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: SpaCy -- intra-word hyphens. How to treat them one word?

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

    test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_select(pattern = stopwords("en"), selection = "remove") 

Solution

  • You should be able to rejoin the hyphenated words in quanteda, using tokens_compound().

    library("quanteda")
    #> Package version: 3.3.1
    #> Unicode version: 14.0
    #> ICU version: 71.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    library("spacyr")
    
    test.corpus <- c(d1 = "NLP is fast-moving.",
                     d2 = "A co-ordinated effort.")
    test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
    #> Found 'spacy_condaenv'. spacyr will use this environment
    #> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
    #> (python options: type = "condaenv", value = "spacy_condaenv")
    test.sp$token <- test.sp$lemma
    test.np <- nounphrase_consolidate(test.sp)
    test.tokens.3 <- as.tokens(test.np)
    
    tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
    #> Tokens consisting of 2 documents.
    #> d1 :
    #> [1] "NLP"         "be"          "fast-moving" "."          
    #> 
    #> d2 :
    #> [1] "a_co-ordinated_effort" "."
    

    Created on 2023-06-09 with reprex v2.0.2