rcorpusquantedasentence

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?


In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break. This post (Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")) is similar but does unfortunately solve the problem. Here is an example:


    library("quanteda")
    
    txt <- c(
      d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
      d2 = "The U.S. is south of Canada."
    )
    corpus(txt) %>%
      corpus_reshape(to = "sentences")

Corpus consisting of 4 documents. d1.1 : "With us we have Dr."

d1.2 : "Smith."

d1.3 : "We are not sure... where we stand."

d2.1 : "The U.S. is south of Canada."

It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!


Solution

  • Please use corpus_segment with pattern & valuetype = "regex".

    You may find example here

    https://quanteda.io/reference/corpus_segment.html

    You may also use use_docvars option.