rregexescapingstringrquanteda

Quanteda and stringr in R: (Correct) regex cannot be parsed


I want to run a regex search using the quanteda and stringr libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...) using the regex \(VP\h+\(V\w*\h+\w*\). Here is a MWE:

library(quanteda)
library(dplyr)
library(stringr)

text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"


kwic_regex <- kwic(
  # define text
  text, 
  # define search pattern
  "\(VP\h+\(V\w*\h+\w*\)", 
  window = 20, 
  # define valuetype
  valuetype = "regex") %>%
  # make it a data frame
  as.data.frame()

And this is the error message:

Error: '\(' is an unrecognized escape in character string starting ""\("

I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\() to no avail. I would really appreciate any ideas on how to improve my query.


Solution

  • To get this to work, you have to understand how tokenisation works in quanteda and how pattern works with multi-token sequences.

    First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens and ?pattern.

    Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase(), which will split it on whitespace. See ?phrase.

    So this will work (and very efficiently):

    library("quanteda")
    #> Package version: 3.3.1
    #> Unicode version: 14.0
    #> ICU version: 71.1
    #> Parallel computing: 12 of 12 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
    
    toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
    print(toks, -1, -1)
    #> Tokens consisting of 1 document.
    #> text1 :
    #>  [1] "(ROOT"        "(S"           "(NP"          "(PRP"         "It))"        
    #>  [6] "(VP"          "(VBZ"         "is)"          "(RB"          "not)"        
    #> [11] "(VP"          "(VBN"         "transmitted)" "(PP"          "(IN"         
    #> [16] "from)"        "(:"           ":)"           "(S"           "(VP"         
    #> [21] "(VBG"         "giving)"      "(NP"          "(NP"          "(NP"         
    #> [26] "(NP"          "(NML"         "(NN"          "blood)"
    
    kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
    #> Keyword-in-context with 3 matches.                                                                     
    #>    [text1, 6:8] (NP (PRP It)) |     (VP (VBZ is)      | (RB not) (VP 
    #>  [text1, 11:13]  is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
    #>  [text1, 20:22]       (::) (S |   (VP (VBG giving)    | (NP (NP (NP
    

    Created on 2023-07-03 with reprex v2.0.2

    Note how you do need to double-escape the reserved characters in the regular expression pattern.

    Created on 2023-07-03 with reprex v2.0.2