Initially I had code in a for-loop which runs correctly although slow. I added code from future.apply
to make it run in parallel. PLease see the code below. The variabele sub_files containes a list of paths to the compressed files on disk.
If I test the function in isolation the function works correct.
If I run the complete code it gives error:
Error in future_apply(sub_files, FUN = process_file, future.seed = TRUE) :
dim(X) must have a positive length
This is the code:
library(tibble)
library(tidyverse)
library(future.apply)
library(jsonlite) # JSON parsing
# Set up parallel processing
plan(multisession)
parent_folder <- "D:/data"
sub_files <- list.files(parent_folder, recursive = TRUE, full.names= TRUE)
mydata_all <- data.frame()
# Function to process each file
process_file <- function(sub_file) {
print(sub_file)
# Check if the file exists and is not empty
if (file.exists(sub_file) && file.info(sub_file)$size > 0) {
# Read the lines from the file
lines <- readLines(sub_file)
# Check if the lines contain valid JSON
if (all(sapply(lines, function(line) {
tryCatch({
fromJSON(line)
TRUE
}, error = function(e) {
FALSE
})
}))) {
# If all lines contain valid JSON, proceed with reading JSON
out <- lapply(lines, fromJSON)
song <- out[[1]]$mc$marketDefinition$song
music_type <- out[[1]]$mc$musicDefinition$name
mydata <- tibble(song = song, music_type = music_type)
return(mydata)
} else {
# If any line doesn't contain valid JSON, handle the error
print("Invalid JSON data in the file!!")
return(NULL)
}
} else {
# If the file doesn't exist or is empty, handle the error
print("The file doesn't exist or is empty!!")
return(NULL)
}
}
# Process files in parallel
results <- future_apply(sub_files, FUN = process_file, future.seed = TRUE)
# Bind rows of non-NULL results
mydata_all <- bind_rows(results[!sapply(results, is.null)])
Any idea what can be the cause of the error? Thanks a lot!
Your error is from sub_files having no dimensions
dim(sub_files)
you can convert it to a matrix (with dimensions) and use future_apply (which needs the MARGIN chosen)
results <- future_apply(x=as.matrix(sub_files),
FUN = process_file,
MARGIN=1L, # by rows
future.seed = TRUE)