rjsongoogle-bigquerybigrquery

load bigquery JSON data dump into R tibble


I have downloaded a JSON extract from Big Query which has nested and repeated fields (similar to the package bigrquery) and am attempting to further manipulate the resulting tibble.

I have the following code to load from JSON and convert to a tibble

library(tidyverse)
ga.list <- lapply(readLines("temp.json"), jsonlite::fromJSON, flatten = TRUE)
ga.df <- tibble(dat = ga.list) %>% 
    unnest_wider(dat) %>% 
    mutate(id = row_number()) %>%
         unnest_wider(b_nested) %>%
         unnest_wider(b3) %>% 
         unnest_wider(b33)

So there were two list columns:

  1. b_nested, this column is a nested list (which I unnested recursively .. maybe there is a more automated way, if so, please advise!)
  2. rr1 and rr2, these columns will always have the same number of elements. So elements 1 of rr1 and rr2 should be read together.

I am still working out how to extract id, rr1 and rr2 and make into a long table with repeated rows for each id row.

Note: this question has been edited a few times as I progress further along .. originally I had got stuck getting it from JSON to a tibble until I found unnest_wider()

temp.json:

{"a":"4000","b_nested":{"b1":"(not set)","b2":"some - text","b3":{"b31":"1591558980","b32":"60259425255","b33":{"b3311":"133997175"},"b4":false},"b5":true},"rr1":[],"rr2":[]} {"a":"4000","b_nested":{"b1":"asdfasdfa","b2":"some - text more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true}, "rr1":["v1","v2","v3"],"rr2":["x1","x2","x3"]} {"a":"6000","b_nested":{"b1":"asdfasdfa","b2":"some - text more","b3":{"b31":"11111","b32":"2222","b33":{"b3311":"3333333"},"b4":true},"b5":true},"rr1":["v1","v2","v3","v4","v5"],"rr2":["aja1","aja2","aja3","aja14","aja5"]}


Solution

  • The final piece of the puzzle; in order to get the repeating rows for repeating record

      ga.df %>% select(id, rr1, rr2)  %>%
      unnest(cols = c(rr1, rr2))
    

    FYI: Link to Big Query Specifying nested and repeated columns

    Another solution (my preference) would be to create a tibble from rr1 and rr1 and keep as a column in ga.df so that purrr functions can be used

        ga.df %>% 
      mutate(rr = map2(rr1, rr2, function(x,y) {
        tibble(rr1 = x, rr2 = y)
      })) %>%
      select(-rr1, -rr2) %>%
      mutate(rr_length = map_int(rr, ~nrow(.x)))