rparallel-processingr-future

Error in future_apply() : dim(X) must have a positive length


Initially I had code in a for-loop which runs correctly although slow. I added code from future.apply to make it run in parallel. PLease see the code below. The variabele sub_files containes a list of paths to the compressed files on disk.

If I test the function in isolation the function works correct.

If I run the complete code it gives error:

Error in future_apply(sub_files, FUN = process_file, future.seed = TRUE) : 
  dim(X) must have a positive length

This is the code:

library(tibble)
library(tidyverse)
library(future.apply)
library(jsonlite) # JSON parsing

# Set up parallel processing
plan(multisession)

parent_folder <- "D:/data"

sub_files <- list.files(parent_folder, recursive = TRUE, full.names= TRUE)

mydata_all <- data.frame()

# Function to process each file
process_file <- function(sub_file) {
  print(sub_file)
  
  # Check if the file exists and is not empty
  if (file.exists(sub_file) && file.info(sub_file)$size > 0) {
    # Read the lines from the file
    lines <- readLines(sub_file)
    
    # Check if the lines contain valid JSON
    if (all(sapply(lines, function(line) {
      tryCatch({
        fromJSON(line)
        TRUE
      }, error = function(e) {
        FALSE
      })
    }))) {
      # If all lines contain valid JSON, proceed with reading JSON
      out <- lapply(lines, fromJSON)
      song <- out[[1]]$mc$marketDefinition$song
      music_type <- out[[1]]$mc$musicDefinition$name
      mydata <- tibble(song = song, music_type = music_type)
      
      return(mydata)
      
    } else {
      # If any line doesn't contain valid JSON, handle the error
      print("Invalid JSON data in the file!!")
      return(NULL)
    }
  } else {
    # If the file doesn't exist or is empty, handle the error
    print("The file doesn't exist or is empty!!")
    return(NULL)
  }
}

# Process files in parallel
results <- future_apply(sub_files, FUN = process_file, future.seed = TRUE)

# Bind rows of non-NULL results
mydata_all <- bind_rows(results[!sapply(results, is.null)])

Any idea what can be the cause of the error? Thanks a lot!


Solution

  • Your error is from sub_files having no dimensions dim(sub_files) you can convert it to a matrix (with dimensions) and use future_apply (which needs the MARGIN chosen)

    results <- future_apply(x=as.matrix(sub_files),
                            FUN = process_file,
                            MARGIN=1L, # by rows
                            future.seed = TRUE)