pythonrtensorflowkerasmini-batch

Generator function in LSTM Keras for outputting mini batches of one files


I have a generator function which works fine. I have a large list of .txt files in which each file is also quite long. The task would now be to write a generator function which Takes:

  1. a batch of Files
  2. and then a batch of size 128 out of one file

my code now:

data_files_generator <- function(train_set) {

  files <- train_set
  next_file <- 0

  function() {

    # move to the next file (note the <<- assignment operator)
    next_file <<- next_file + 1

    # if we've exhausted all of the files then start again at the
    # beginning of the list (keras generators need to yield
    # data infinitely -- termination is controlled by the epochs
    # and steps_per_epoch arguments to fit_generator())
    if (next_file > length(files))
    {next_file <<- 1}

    # determine the file name
    file <- files[[next_file]]

    text <- read_lines(paste(data_dir, file, sep = "" )) %>%
      str_to_lower() %>%
      str_c(collapse = "\n") %>%
      removeNumbers() %>%
      tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

    text <- text[text %in% chars]

    dataset <- map(
      seq(1, length(text) - maxlen - 1, by = 3), 
      ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
    )

    dataset <- transpose(dataset)

    # Vectorization
    x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
    y <- array(0, dim = c(length(dataset$sentece), length(chars)))

    for(i in 1:length(dataset$sentece)){

      x[i,,] <- sapply(chars, function(x){
        as.integer(x == dataset$sentece[[i]])
      })

      y[i,] <- as.integer(chars == dataset$next_char[[i]])

    }
    rounded_dim <- floor(dim(x)[1]/mini_batch_size)
    match_size_to_batch <- 128 * rounded_dim

    x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
    y <- y_val[1:match_size_to_batch, 1:length(chars)]

    return(list(x, y))

  }
}

So what is coming is coming in is a Text file which is transformed into smaller pieces of text (of length maxlen) and is then one hot encoded into 0 and 1 matrices.

The problem is that from my code the output is one Data Cube of size maxlen x lenght(chars) x samples where the number of samples is very big and that why I would like my generator function to output always a cube of size maxlen x lenght(chars) x samples(128) and then output the next batch of size maxlen x lenght(chars) x samples until the whole text file is read in and then go to the next text file...

The output for now is an error:

 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Cannot feed value of shape (112512, 40, 43) for Tensor 'lstm_layer_input_1:0', which has shape '(128, 40, 43)' 

hope I have explained it good enough to understand. I think I have to input some kind of for loop to iterate over the sample length but I have no Idea how to include this into the gen. function.


Solution

  • I have implemented an for loop which is returning now batches of size 128:

    Changed Code:

    data_files_generator <- function(train_set) {
    
      files <- train_set
      next_file <- 0
    
      function() {
    
        # move to the next file (note the <<- assignment operator)
        next_file <<- next_file + 1
    
        # if we've exhausted all of the files then start again at the
        # beginning of the list (keras generators need to yield
        # data infinitely -- termination is controlled by the epochs
        # and steps_per_epoch arguments to fit_generator())
        if (next_file > length(files))
        {next_file <<- 1}
    
        # determine the file name
        file <- files[[next_file]]
    
        text <- read_lines(paste(data_dir, file, sep = "" )) %>%
          str_to_lower() %>%
          str_c(collapse = "\n") %>%
          removeNumbers() %>%
          tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)
    
        text <- text[text %in% chars]
    
        dataset <- map(
          seq(1, length(text) - maxlen - 1, by = 3), 
          ~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
        )
    
        dataset <- transpose(dataset)
    
        # Vectorization
        x <- array(0, dim = c(length(dataset$sentece), maxlen, length(chars)))
        y <- array(0, dim = c(length(dataset$sentece), length(chars)))
    
        for(i in 1:length(dataset$sentece)){
    
          x[i,,] <- sapply(chars, function(x){
            as.integer(x == dataset$sentece[[i]])
          })
    
          y[i,] <- as.integer(chars == dataset$next_char[[i]])
    
        }
        rounded_dim <- floor(dim(x)[1]/mini_batch_size)
        match_size_to_batch <- 128 * rounded_dim
    
        x <- x[1:match_size_to_batch, 1:maxlen, 1:length(chars)]
        y <- y_val[1:match_size_to_batch, 1:length(chars)]
    
        #Edit:
        span_start <-1
        for (iter in 1:rounded_dim){
         i <- iter * 128
         span_end <- iter * 128
         x <- x[span_start:span_end, 1:maxlen, 1:length(chars)]
         y <- y[span_start:span_end, 1:length(chars)]
         span_start <- i
         return(list(x, y))
        }
      }
    }