I want to run a regex search using the quanteda
and stringr
libraries, but I continue to receive errors. My goal is to match the patterns (VP (V.. ...)
using the regex \(VP\h+\(V\w*\h+\w*\)
. Here is a MWE:
library(quanteda)
library(dplyr)
library(stringr)
text <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
kwic_regex <- kwic(
# define text
text,
# define search pattern
"\(VP\h+\(V\w*\h+\w*\)",
window = 20,
# define valuetype
valuetype = "regex") %>%
# make it a data frame
as.data.frame()
And this is the error message:
Error: '\(' is an unrecognized escape in character string starting ""\("
I find it puzzling because the regex should be correct (cf. https://regex101.com/r/3hbZ0R/1). I've also tried escaping the escapes (e.g., \\(
) to no avail. I would really appreciate any ideas on how to improve my query.
To get this to work, you have to understand how tokenisation works in quanteda and how pattern
works with multi-token sequences.
First, tokenisation (by default) removes the whitespace that you are including in your regex pattern. But for your pattern, this is not the important part; rather, the sequence is the important part. Also, the current default tokeniser will split parentheses from the POS tags and text. So you want to control this by using a different tokeniser that splits on (and removes) whitespace. See ?tokens
and ?pattern
.
Second, to match sequences of tokens, you need to wrap your multi-token pattern in phrase()
, which will split it on whitespace. See ?phrase
.
So this will work (and very efficiently):
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "(ROOT (S (NP (PRP It)) (VP (VBZ is) (RB not) (VP (VBN transmitted) (PP (IN from) (: :) (S (VP (VBG giving) (NP (NP (NP (NP (NML (NN blood)"
toks <- tokens(txt, what = "fasterword", remove_separators = TRUE)
print(toks, -1, -1)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "(ROOT" "(S" "(NP" "(PRP" "It))"
#> [6] "(VP" "(VBZ" "is)" "(RB" "not)"
#> [11] "(VP" "(VBN" "transmitted)" "(PP" "(IN"
#> [16] "from)" "(:" ":)" "(S" "(VP"
#> [21] "(VBG" "giving)" "(NP" "(NP" "(NP"
#> [26] "(NP" "(NML" "(NN" "blood)"
kwic(toks, phrase("\\(VP \\(V \\)"), window = 3, valuetype = "regex")
#> Keyword-in-context with 3 matches.
#> [text1, 6:8] (NP (PRP It)) | (VP (VBZ is) | (RB not) (VP
#> [text1, 11:13] is) (RB not) | (VP (VBN transmitted) | (PP (IN from)
#> [text1, 20:22] (::) (S | (VP (VBG giving) | (NP (NP (NP
Created on 2023-07-03 with reprex v2.0.2
Note how you do need to double-escape the reserved characters in the regular expression pattern.
Created on 2023-07-03 with reprex v2.0.2