Convert a column containing unnamed and unordered attributes strings into a column per attribute using rrapply

I have a dataframe column consisting of a string of attributes in the format "attribute1=>value1,attribute2=>value2,...". Entries have different attributes in different orders. For example, row 1 may have attributes A, B, and C while row 2 may have attributes C, B, D, and E.

I want to filter the attribute list by string match and then expand this column into multiple columns where each column name is the attribute name. Missing values should be populated by NA.

I have a solution using lots of lapply statements but I expect there is a more efficient one, either through simplifying the current solution, or using rrapply or another function/package (such as separate_wider_delim). The dataframe will have ~500,000 rows.

Sample data

my_ids = c(1066,1990,578,2299,725,219,1229,2089,383,544,1243,1164,723)

my_attributes = c("addr:city=>Shanklin,addr:country=>GB,addr:housenumber=>36,38,addr:postcode=>PO37 6JY,addr:street=>High Street,amenity=>restaurant,level=>0",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
               "addr:city=>Ryde,addr:housenumber=>78,addr:street=>West Street,amenity=>doctors,healthcare=>doctor",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
               "addr:city=>Ryde,addr:country=>GB,addr:housenumber=>134,addr:postcode=>PO33 2RJ,addr:street=>High Street,amenity=>community_centre,check_date:opening_hours=>2023-09-16,fhrs:id=>1647467,level=>0,opening_hours:signed=>no",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
               "addr:city=>Ventnor,addr:country=>GB,addr:housename=>Froghill Farm,addr:postcode=>PO38 3AN,addr:street=>Shanklin Road,addr:village=>Godshill,amenity=>fuel,brand=>Gulf,brand:wikidata=>Q5617505,car_wash=>no,fhrs:id=>540417",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
               "addr:city=>Freshwater,addr:country=>GB,addr:housename=>The Apple Farm,addr:postcode=>PO40 9XR,addr:street=>Newport Road,amenity=>cafe,fhrs:id=>540510,source:addr=>FHRS Open Data",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
               "access=>yes,addr:city=>Ventnor,addr:postcode=>PO38 1UF,addr:street=>Steephill Road,amenity=>parking,capacity=>10,charge=>£3 per day,contact:email=>steephillcarpark@gmail.com,fee=>yes,park_ride=>no,parking=>surface,payment:coins=>yes,surface=>unpaved",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
               "addr:city=>Ventnor,addr:country=>GB,addr:postcode=>PO38 1UF,addr:street=>Steephill Road,amenity=>pub,fhrs:id=>541580,source:addr=>FHRS Open Data,wheelchair=>yes",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
               "addr:city=>Ventnor,addr:country=>GB,addr:housename=>Ventnor Park,addr:postcode=>PO38 1JZ,addr:street=>Park Avenue,amenity=>cafe,drinking_water:refill=>yes,drinking_water:refill:network=>Refill,fhrs:id=>540763",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
               "addr:city=>Ventnor,addr:country=>GB,addr:postcode=>PO38 3LY,addr:street=>Newport Road,addr:village=>Godshill,fhrs:id=>541337,shop=>farm",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
               "addr:city=>Newport,addr:country=>GB,addr:postcode=>PO30 2PD,addr:street=>Briddlesford Road,amenity=>recycling,opening_hours=>Nov-Mar: 10:00-18:00; Apr-Oct: 7:00-20:00; Dec 25 off,owner=>Isle of Wight Council,ownership=>municipal,recycling:batteries=>yes,recycling:bulky_waste=>yes,recycling:cans=>yes,recycling:car_batteries=>yes,recycling:cardboard=>yes,recycling:clothes=>yes,recycling:cooking_oil=>yes,recycling:electrical_appliances=>yes,recycling:electrical_items=>yes,recycling:engine_oil=>yes,recycling:fluorescent_tubes=>yes,recycling:glass=>yes,recycling:green_waste=>yes,recycling:paper=>yes,recycling:plastic=>yes,recycling:plastic_bottles=>yes,recycling:residual_waste=>yes,recycling:scrap_metal=>yes,recycling:small_appliances=>yes,recycling:white_goods=>yes,recycling:wood=>yes,recycling_type=>centre,ref:GB:uprn=>10024247154,website=>https://www.iow.gov.uk/Residents/Environment-Planning-and-Waste/Waste-and-Recycling/Recycling-Centres/Lynnbottom-HWRC,wikidata=>Q106093911",
               "addr:city=>Carisbrooke,addr:country=>GB,addr:housename=>Carisbrooke Castle,addr:postcode=>PO30 1XY,addr:street=>Castle Hill,amenity=>cafe,bar=>no,check_date=>2025-05-04,cuisine=>coffee_shop,diet:vegetarian=>yes,fhrs:id=>540472,outdoor_seating=>yes,payment:cash=>yes,payment:credit_cards=>yes,payment:debit_cards=>yes,takeaway=>yes",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
               "addr:city=>Gatcombe,addr:country=>GB,addr:housename=>Little Gatcombe Farm,addr:postcode=>PO30 3EQ,addr:street=>Newbarn Lane,amenity=>cafe,fhrs:id=>540818,operator=>Little Gatcombe Farm",                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
               "addr:city=>Freshwater,addr:country=>GB,addr:county=>Isle of Wight,addr:postcode=>PO40 9DT,addr:street=>Queens Road,amenity=>doctors,healthcare=>doctor,healthcare:speciality=>general,opening_hours=>Mo-Fr 08:30-18:00,operator=>West Wight Medical Partnerships,website=>https://brooksidehealthcentre.nhs.uk/"   
  )

datadf = data.frame(id = my_ids,
                    attribute_tags = my_attributes)

Desired output:

id	addr:city	addr:country	addr:housenumber	addr:postcode	addr:street	addr:housename	addr:village	addr:county
1066	Shanklin	GB	36	PO37 6JY	High Street	NA	NA	NA
1990	Ryde	NA	78	NA	West Street	NA	NA	NA
578	Ryde	GB	134	PO33 2RJ	High Street	NA	NA	NA
2299	Ventnor	GB	NA	PO38 3AN	Shanklin Road	Froghill Farm	Godshill	NA
725	Freshwater	GB	NA	PO40 9XR	Newport Road	The Apple Farm	NA	NA
219	Ventnor	NA	NA	PO38 1UF	Steephill Road	NA	NA	NA
1229	Ventnor	GB	NA	PO38 1UF	Steephill Road	NA	NA	NA
2089	Ventnor	GB	NA	PO38 1JZ	Park Avenue	Ventnor Park	NA	NA
383	Ventnor	GB	NA	PO38 3LY	Newport Road	NA	Godshill	NA
544	Newport	GB	NA	PO30 2PD	Briddlesford Road	NA	NA	NA
1243	Carisbrooke	GB	NA	PO30 1XY	Castle Hill	Carisbrooke Castle	NA	NA
1164	Gatcombe	GB	NA	PO30 3EQ	Newbarn Lane	Little Gatcombe Farm	NA	NA
723	Freshwater	GB	NA	PO40 9DT	Queens Road	NA	NA	Isle of Wight

Current solution:

datadf = data.frame(
  id = sample(3000, 13),
  attributes = a1[57:69])

tags_sep = function(myname) {  # turn into named list
  mystr = strsplit(myname, '=>')
  mylist = mystr[[1]][2]
  names(mylist) = mystr[[1]][1]
  return(mylist)
}

tmp1 = rapply(as.list(datadf$attribute_tags), function(x) strsplit(x, ','), how = 'list')
tmp2 = rapply(tmp1, function(x) subset(x, grepl('addr:', x)), how = 'list')  # get attributes with 'addr' in the name
tmp3 = unlist(tmp2, recursive = F)
tmp4 = lapply(tmp3, function(x) (lapply(x, function(y) tags_sep(y))))  # create named list
tmp5 = lapply(tmp4, function(x) unlist(x, recursive = F))
tmp6 = bind_rows(tmp5)

outputdf = cbind(datadf['id'], tmp6)

Attempt using `rrapply`

library(rrapply)
tmp1 = lapply(as.list(datadf$attribute_tags), function(x) as.list(strsplit(x, ',')[[1]]))  # get into list of lists
tmp2 = rrapply(tmp1,
             condition = function(x)  grepl('addr:', x),
             how = "prune")
tmp3 = rrapply(tmp2,
             f = function(x) tags_sep(x),
             how = 'melt')

Unfortunately, the attribute names are not present in the dataframe. A similar problem arises when using how = 'bind'. The resulting dataframe does not have attribute names as column names and does not properly handle missing attributes or attributes in a different order.

I have seen questions and vignettes and how to handle these issues but these are always starting with a named list. I am trying to create a named list and then a dataframe.

Solution

In base R you could use some regex to accomplish this. The goal of regex is to make the string to a compatible dcf file

a <- gsub(",?([^ ,]+=>)", "\n\\1", datadf$attribute_tags)
b <- gsub("(.*?):(?=.*=>)", "\\1_", a, perl = TRUE)
d <- read.dcf(textConnection(gsub("=>", ":", b)), all = TRUE)
cbind(datadf[1], subset(d, select = startsWith(names(d), "addr")))

     id   addr_city addr_country addr_housenumber addr_postcode       addr_street       addr_housename addr_village   addr_county
1  1066    Shanklin           GB            36,38      PO37 6JY       High Street                 <NA>         <NA>          <NA>
2  1990        Ryde         <NA>               78          <NA>       West Street                 <NA>         <NA>          <NA>
3   578        Ryde           GB              134      PO33 2RJ       High Street                 <NA>         <NA>          <NA>
4  2299     Ventnor           GB             <NA>      PO38 3AN     Shanklin Road        Froghill Farm     Godshill          <NA>
5   725  Freshwater           GB             <NA>      PO40 9XR      Newport Road       The Apple Farm         <NA>          <NA>
6   219     Ventnor         <NA>             <NA>      PO38 1UF    Steephill Road                 <NA>         <NA>          <NA>
7  1229     Ventnor           GB             <NA>      PO38 1UF    Steephill Road                 <NA>         <NA>          <NA>
8  2089     Ventnor           GB             <NA>      PO38 1JZ       Park Avenue         Ventnor Park         <NA>          <NA>
9   383     Ventnor           GB             <NA>      PO38 3LY      Newport Road                 <NA>     Godshill          <NA>
10  544     Newport           GB             <NA>      PO30 2PD Briddlesford Road                 <NA>         <NA>          <NA>
11 1243 Carisbrooke           GB             <NA>      PO30 1XY       Castle Hill   Carisbrooke Castle         <NA>          <NA>
12 1164    Gatcombe           GB             <NA>      PO30 3EQ      Newbarn Lane Little Gatcombe Farm         <NA>          <NA>
13  723  Freshwater           GB             <NA>      PO40 9DT       Queens Road                 <NA>         <NA> Isle of Wight

If speed is necessary, then only read the required fields:

field <- c("city", "country", "housenumber", "postcode", "street", "housename", 
            "village", "county")

fields <- paste0("addr_", field)

a <- gsub(",?([^ ,]+=>)", "\n\\1", datadf$attribute_tags)
b <- gsub("(.*?):(?=.*=>)", "\\1_", a, perl = TRUE)
d <- read.dcf(textConnection(gsub("=>", ":", b)), fields)
cbind(datadf[1], d)

Convert a column containing unnamed and unordered attributes strings into a column per attribute using rrapply

Sample data

Desired output:

Current solution:

Attempt using rrapply

Attempt using `rrapply`