rjsonrbind

Reading JSON files in R and creating a dataset


I am trying to create a dataset in R from 3k JSON files. (This is the data: Nauman, F. (2023). Clothing Dataset for Second-Hand Fashion (Version 1) [Data set].Available at: Zenodo. https://doi.org/10.5281/zenodo.8386668 )

My goal is to have the data in R as a dataset/table so I can clean it and run some regressions. This is for a school paper, I'm quite new to R.

Here is my code:

### Reading JSON files

library(rjson)

# Instantiate the data object to hold the JSON details
master_data <- NULL

# Gather the names of the JSON file held in a folder
file_list <- list.files(path = "data/json_files")

# Loop through the list of files, read the JSON details, 
#   convert to data frame and append to the data object
for (i in 1:length(file_list)){
  file_details <- fromJSON(file = paste0("data/json_files/",file_list[i]))
  master_data <- rbind(master_data, as.data.frame(file_details))
}

# Check the data object
master_data

#This is the error I'm getting:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  the arguments imply different numbers of rows : 1, 0

#I tried bind_rows()  to bind uneven rows

for (i in 1:length(file_list)){
  
  file_details <- fromJSON(file = paste0("data/json_files/",file_list[i]))
  
  master_data <- bind_rows(master_data, as.data.frame(file_details))
  
}
#that doesn't work either, same error

#I tried rbind.fill() from the package plyr, 

for (i in 1:length(file_list)){  
  file_details <- fromJSON(file = paste0("data/json_files/",file_list[i]))
  
  master_data <- rbind.fill(master_data, as.data.frame(file_details))
  
}

#that doesn't work either, same error #Any ideas will be appreaciated. Thank you!


Solution

  • I used the CRAN package rjsoncons.

    I listed the full path to files, so I didn't need to construct them by pasting onto a base file name

    file_list <- list.files(
        "~/tmp/circular_fashion",
        pattern = ".*json",
        recursive = TRUE,
        full.names = TRUE
    )
    

    I noticed that one of the files is not valid JSON. I found this by iterating through each file and trying simply to read it using j_query(). If reading failed, I printed out the file name and error, and used 'NA' for the content.

    json <- vapply(file_list, function(file) {
        tryCatch({
            rjsoncons::j_query(file)
        }, error = function(e) {
            message(file, ": ", conditionMessage(e))
            NA_character_
        })
    }, character(1))
    
    

    The output is

    oct2022/2022-10-17/labels_2022_10_17_07_40_32.json: Extra comma at line 12 and column 6
    

    and the JSON file is actually incorrect:

    ...
        "colors": [
            "red",
            
            
          
        
        ],
    ...
    

    I removed the corrupt data from the JSON strings that I've read in

    json <- json[!is.na(json)]
    

    and then made the R character vector of JSON objects into an array-of-objects

    json_array <- paste0("[", paste(json, collapse = ","), "]")
    

    Finally, I used j_pivot to change the array-of-objects to an R tibble

    tbl <- rjsoncons::j_pivot(json_array, as = "tibble")
    

    Here's the result:

    > tbl
    # A tibble: 3,052 × 22
       brand     brandtext category type  size  colors season pilling condition price
       <chr>     <list>    <chr>    <chr> <chr> <list> <chr>    <int>     <int> <chr>
     1 Everest   <chr [1]> Children Wint… "104" <chr>  Winter       3         3 50-1…
     2 Everest   <chr [1]> Children Wint… "104" <chr>  Winter       3         3 50-1…
     3 Not in t… <chr [1]> Men      Jack… "M "  <chr>  Autumn       5         5 >400 
     4 Everest   <chr [1]> Children Jack… "146" <chr>  Autumn       5         5 100-…
     5 Etirel    <chr [1]> Ladies   Wint… "40"  <chr>  Winter       4         3 100-…
     6 Lindex    <chr [1]> Children Rain… "98"  <chr>  Spring       4         2 <50  
     7 Not in t… <chr [1]> Ladies   Dress "42"  <chr>  Autumn       5         5 100-…
     8 Not in t… <chr [1]> Men      Trou… "Non… <chr>  Autumn       5         4 100-…
     9 Not in t… <chr [1]> Ladies   Blou… "42"  <chr>  Spring       5         5 50-1…
    10 Park Lane <chr [1]> Men      Swea… "M "  <chr>  Autumn       5         5 100-…
    # ℹ 3,042 more rows
    # ℹ 12 more variables: annotator <list>, cut <list>, pattern <chr>, trend <chr>,
    #   smell <list>, stains <chr>, holes <list>, damage <chr>, material <chr>,
    #   comment <chr>, usage <chr>, weight <list>
    # ℹ Use `print(n = ...)` to see more rows