rdata.tablegrepl

Search for pattern in datatable with grepl


I have a datatable, which looks like this:

test <- data.table(
        ID = 1:5,
        pattern = c("01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345",
                    "A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345",
                    "A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789",
                    "12345678 ~ 01234567 ~ 12345678 ~ 23456789",
                    "A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789")
        )

I want to know if the rows have a pattern, starting with a character (A-Z) and have less than 7 or more than 7 digits. For this, I want to add two new columns, named smaller7 and greater7, which say TRUE or FALSE.

I tried with this:

      # less than 7 digits
      test[, smaller7 := grepl("(^| )[A-Z].{1,5}($| )", test$pattern, ignore.case = TRUE)]
      
      # more than 7 digits
      test[, greater7 := grepl("(^| )[A-Z].{7,}($| )$", test$pattern, ignore.case = TRUE)]

The result is:

ID pattern smaller7 greater7
1 01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345 FALSE FALSE
2 A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345 TRUE TRUE
3 A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE TRUE
4 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE FALSE
5 A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE TRUE

But the right output should be this:

ID pattern smaller7 greater7
1 01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345 FALSE FALSE
2 A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345 TRUE FALSE
3 A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE FALSE
4 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE FALSE
5 A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE TRUE

Whats wrong with my grepl? I need a solution without splitting column. Thanks for your help!


Solution

  • I think you are just missing a few brackets here and there. I also tried to be a bit more explicit with the digits.

    # less than 7 digits
    test[, smaller7 := grepl("(^| )[[:alpha:]]+([[:alnum:]]{1,5})($| )", pattern),]
    # more than 7 digits
    test[, greater7 := grepl("(^| )[[:alpha:]]+([[:alnum:]]{7,})($| )", pattern),]
    test
    #   ID                                                     pattern smaller7 greater7
    #1:  1         01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345    FALSE    FALSE
    #2:  2 A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345     TRUE    FALSE
    #3:  3         A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789    FALSE    FALSE
    #4:  4                   12345678 ~ 01234567 ~ 12345678 ~ 23456789    FALSE    FALSE
    #5:  5        A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789    FALSE     TRUE
    

    Also here is the alternative with [A-z] and [A-z0-9].

    # less than 7 digits
    test[, smaller7 := grepl("(^| )[A-z]+([A-z0-9]{1,5})($| )", pattern),]
    # more than 7 digits
    test[, greater7 := grepl("(^| )[A-z]+([A-z0-9]{7,})($| )", pattern),]
    
    

    But from your description what you probably want is "(^| )[[:alpha:]]+([[:alnum:]]{1,6})($| )" instead of "(^| )[[:alpha:]]+([[:alnum:]]{1,5})($| )" to really get something under 7 and on the other hand {8,}. Otherwise you get also lengths which are exactly 7.