rregex

REGEX to split address components using R - numbers in street/avenue names


Consider the following set of addresses to be split, contained in a variable called addresses:

RUA DOS PASSOS 1523 CS 2

AV CAXIAS, 02 CASA 05

AV 11 DE NOVEMBRO 2032 CASA 4

RUA 05 DE OUTUBRO, 25 CASA 02

The final output should be a table like so:

logradouro no_logradouro nu_logradouro complemento
RUA DOS PASSOS 1523 CS 2 RUA DOS PASSOS 1523 CS 2
AV CAXIAS, 02 CASA 05 AV CAXIAS 02 CASA 05
AV 11 DE NOVEMBRO 2032 CASA 4 AV 11 DE NOVEMBRO 2032 CASA 4
RUA 05 DE OUTUBRO, 25 CASA 02 RUA 05 DE OUTUBRO 25 CASA 02

Some important points:

  1. The presence of the comma on the addresses is not mandatory; some have them, some don't. I'd like for the final table to have no commas anywhere;
  2. There can be trailing zeroes, although they are not mandatory.

My current attempt is as follows:

addresses <- c("RUA DOS PASSOS 1523 CS 2", "AV CAXIAS, 02 CASA 05", "AV 11 DE NOVEMBRO 2032 CASA 4", "RUA 05 DE OUTUBRO, 25 CASA 02")
regex <- "^(\\D+(?:(?!\\s\\d+\\s\\D).)*?)\\s(\\d+)(?:\\s+(.*))?$" 
result <- str_match(addresses, regex)
df_result <- data.frame(logradouro = addresses,
             no_logradouro = ifelse(is.na(result[, 2]), addresses, str_squish(result[, 2])),
             nu_logradouro = ifelse(is.na(result[, 3]), "", str_squish(result[, 3])),
             complemento = ifelse(is.na(result[, 4]), "", str_squish(result[, 4])))
  

and the output is:

logradouro no_logradouro nu_logradouro complemento
RUA DOS PASSOS 1523 CS 2 RUA DOS PASSOS 1523 CS 2
AV CAXIAS, 02 CASA 05 AV CAXIAS, 02 CASA 05
AV 11 DE NOVEMBRO 2032 CASA 4 AV 11 DE NOVEMBRO 2032 CASA 4
RUA 05 DE OUTUBRO, 25 CASA 02 RUA 05 DE OUTUBRO, 25 CASA 02

As you can see, the REGEX works for the composite cases (i.e., "AV 11 DE NOVEMBRO 2032 CASA 4") but fails for the others. How can I adapt my REGEX to work on both cases, taking into account what I've discussed in this post?

In summary, the rules are that each address should be split into 3 parts, if possible: no_logradouro, which represents the name of street, avenue etc; nu_logradouro, which represents the number of the address and complemento which is everything else.


Solution

  • If you make the assumption that the complemento values start with CS or CASA then you can work backwards. Something like this will work

    x <- c("RUA DOS PASSOS 1523 CS 2", "AV CAXIAS, 02 CASA 05", 
      "AV 11 DE NOVEMBRO 2032 CASA 4", "RUA 05 DE OUTUBRO, 25 CASA 02")
    
    strcapture(
      "(.*?),?\\s(\\d+)\\s+((?:CASA|CS) \\d+)$", 
      x, 
      proto = data.frame(no_logradouro=character(), nu_logradouro=character(), complemento=character())
    )
    

    and returns

          no_logradouro nu_logradouro complemento
    1    RUA DOS PASSOS          1523        CS 2
    2         AV CAXIAS            02     CASA 05
    3 AV 11 DE NOVEMBRO          2032      CASA 4
    4 RUA 05 DE OUTUBRO            25     CASA 02