I have a datatable, which looks like this:
test <- data.table(
ID = 1:5,
pattern = c("01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345",
"A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345",
"A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789",
"12345678 ~ 01234567 ~ 12345678 ~ 23456789",
"A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789")
)
I want to know if the rows have a pattern, starting with a character (A-Z) and have less than 7 or more than 7 digits. For this, I want to add two new columns, named smaller7
and greater7
, which say TRUE or FALSE.
I tried with this:
# less than 7 digits
test[, smaller7 := grepl("(^| )[A-Z].{1,5}($| )", test$pattern, ignore.case = TRUE)]
# more than 7 digits
test[, greater7 := grepl("(^| )[A-Z].{7,}($| )$", test$pattern, ignore.case = TRUE)]
The result is:
ID | pattern | smaller7 | greater7 |
---|---|---|---|
1 | 01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345 | FALSE | FALSE |
2 | A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345 | TRUE | TRUE |
3 | A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | TRUE |
4 | 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | FALSE |
5 | A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | TRUE |
But the right output should be this:
ID | pattern | smaller7 | greater7 |
---|---|---|---|
1 | 01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345 | FALSE | FALSE |
2 | A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345 | TRUE | FALSE |
3 | A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | FALSE |
4 | 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | FALSE |
5 | A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 | FALSE | TRUE |
Whats wrong with my grepl? I need a solution without splitting column. Thanks for your help!
I think you are just missing a few brackets here and there. I also tried to be a bit more explicit with the digits.
# less than 7 digits
test[, smaller7 := grepl("(^| )[[:alpha:]]+([[:alnum:]]{1,5})($| )", pattern),]
# more than 7 digits
test[, greater7 := grepl("(^| )[[:alpha:]]+([[:alnum:]]{7,})($| )", pattern),]
test
# ID pattern smaller7 greater7
#1: 1 01234567 ~ 01234567 ~ 12345678 ~ 23456789 ~ A012345 FALSE FALSE
#2: 2 A012345 ~ 12345678 ~ A01234 ~ 12345678 ~ 23456789 ~ A012345 TRUE FALSE
#3: 3 A012345 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE FALSE
#4: 4 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE FALSE
#5: 5 A0123456 ~ 12345678 ~ 01234567 ~ 12345678 ~ 23456789 FALSE TRUE
Also here is the alternative with [A-z]
and [A-z0-9]
.
# less than 7 digits
test[, smaller7 := grepl("(^| )[A-z]+([A-z0-9]{1,5})($| )", pattern),]
# more than 7 digits
test[, greater7 := grepl("(^| )[A-z]+([A-z0-9]{7,})($| )", pattern),]
But from your description what you probably want is "(^| )[[:alpha:]]+([[:alnum:]]{1,6})($| )"
instead of "(^| )[[:alpha:]]+([[:alnum:]]{1,5})($| )"
to really get something under 7 and on the other hand {8,}
.
Otherwise you get also lengths which are exactly 7.