pythonmachine-learningpytorchallennlpelmo

Error training ELMo - RuntimeError: The size of tensor a (5158) must match the size of tensor b (5000) at non-singleton dimension 1


I am trying to train my own custom ELMo model on AllenNLP.

The following bug RuntimeError: The size of tensor a (5158) must match the size of tensor b (5000) at non-singleton dimension 1 arises when training the model. There are instances where the size of tensor a is stated to be other values (e.g. 5300). When I tested on a small subset of files, I was able to train the model successfully.

Based on my intuition, this is something that deals with the number of tokens in my model. More specifically specific files which has tokens more than 5000. However, there is no parameter within the AllenNLP package which allows me to tweak this to bypass this error.

Any advice on how I can overcome this issue? Would tweaking the PyTorch code to set it at a 5000 size work (If yes, how can I do that)? Any insights will be deeply appreciated.

FYI, I am currently using a customised DatasetReader for tokenisation purposes. I've generated my own vocab list before training the model (to save some time) which is used to train the ELMo model via AllenNLP.

Update: I found out that there is this variable from AllenNLP max_len=5000 which is why the error is showing. See code here. I've tweaked the parameter to larger values and ended up with CUDA Out of Memory Error on many occasions instead. Making me believe this should not be touched.

Environment: Python 3.6.9, Linux Ubuntu, allennlp=2.9.1, allennlp-models=2.9.0

Traceback:

Traceback (most recent call last):
  File "/home/jiayi/.local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 121, in main
    args.func(args)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/train.py", line 120, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/train.py", line 179, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/train.py", line 246, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/train.py", line 470, in _train_worker
    metrics = train_loop.run()
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/commands/train.py", line 543, in run
    return self.trainer.train()
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/training/gradient_descent_trainer.py", line 720, in train
    metrics, epoch = self._try_train()
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/training/gradient_descent_trainer.py", line 741, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/training/gradient_descent_trainer.py", line 459, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp/training/gradient_descent_trainer.py", line 352, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/home/jiayi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp_models/lm/models/language_model.py", line 257, in forward
    embeddings, mask
  File "/home/jiayi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp_models/lm/modules/seq2seq_encoders/bidirectional_lm_transformer.py", line 282, in forward
    token_embeddings = self._position(token_embeddings)
  File "/home/jiayi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jiayi/.local/lib/python3.6/site-packages/allennlp_models/lm/modules/seq2seq_encoders/bidirectional_lm_transformer.py", line 68, in forward
    return x + self.positional_encoding[:, : x.size(1)]
RuntimeError: The size of tensor a (5385) must match the size of tensor b (5000) at non-singleton dimension 1

AllenNLP training config file:

// For more info on config files generally, see https://guide.allennlp.org/using-config-files

local NUM_GRAD_ACC = 4;
local BATCH_SIZE = 1;

local BASE_LOADER = {
  "max_instances_in_memory": 8,
  "batch_sampler": {
    "type": "bucket",
    "batch_size": BATCH_SIZE,
    "sorting_keys": ["source"]
  }
};

{
    "dataset_reader" : {
        "type": "mimic_reader",
        "token_indexers": {
            "tokens": {
                "type": "single_id"
            },
            "token_characters": {
                "type": "elmo_characters"
            }
        },
        "start_tokens": ["<S>"],
        "end_tokens": ["</S>"],
    },
    "train_data_path": std.extVar("MIMIC3_NOTEEVENTS_DISCHARGE_PATH"),
    // Note: We don't set a validation_data_path because the softmax is only
    // sampled during training. Not sampling on GPUs results in a certain OOM
    // given our large vocabulary. We'll need to evaluate against the test set
    // (when we'll want a full softmax) with the CPU.
    "vocabulary": {
        // Use a prespecified vocabulary for efficiency.
        "type": "from_files",
        "directory": std.extVar("ELMO_VOCAB_PATH"),
        // Plausible config for generating the vocabulary.
        // "tokens_to_add": {
        //     "tokens": ["<S>", "</S>"],
        //     "token_characters": ["<>/S"]
        // },
        // "min_count": {"tokens": 3}
    },
    "model": {
        "type": "language_model",
        "bidirectional": true,
        "num_samples": 8192,
        # Sparse embeddings don't work with DistributedDataParallel.
        "sparse_embeddings": false,
        "text_field_embedder": {
        "token_embedders": {
            "tokens": {
            "type": "empty"
            },
            "token_characters": {
                "type": "character_encoding",
                "embedding": {
                    "num_embeddings": 262,
                    // Same as the Transformer ELMo in Calypso. Matt reports that
                    // this matches the original LSTM ELMo as well.
                    "embedding_dim": 16
                },
                "encoder": {
                    "type": "cnn-highway",
                    "activation": "relu",
                    "embedding_dim": 16,
                    "filters": [
                        [1, 32],
                        [2, 32],
                        [3, 64],
                        [4, 128],
                        [5, 256],
                        [6, 512],
                        [7, 1024]],
                    "num_highway": 2,
                    "projection_dim": 512,
                    "projection_location": "after_highway",
                    "do_layer_norm": true
                }
            }
        }
        },
        // Consider the following.
        // remove_bos_eos: true,
        // Applies to the contextualized embeddings.
        "dropout": 0.1,
        "contextualizer": {
            "type": "bidirectional_language_model_transformer",
            "input_dim": 512,
            "hidden_dim": 4096,
            "num_layers": 2,
            "dropout": 0.1,
            "input_dropout": 0.1
        }
    },
    "data_loader": BASE_LOADER,
    // "distributed": {
    //     "cuda_devices": [0, 1],
    // },
    "trainer": {
        "num_epochs": 10,
        "cuda_devices": [0, 1, 2, 3],
        "optimizer": {
        // The gradient accumulators in Adam for the running stdev and mean for
        // words not used in the sampled softmax would be decayed to zero with the
        // standard "adam" optimizer.
        "type": "dense_sparse_adam"
        },
        // "grad_norm": 10.0,
        "learning_rate_scheduler": {
        "type": "noam",
        // See https://github.com/allenai/calypso/blob/master/calypso/train.py#L401
        "model_size": 512,
        // See https://github.com/allenai/calypso/blob/master/bin/train_transformer_lm1b.py#L51.
        // Adjusted based on our sample size relative to Calypso's.
        "warmup_steps": 6000
        },
        "num_gradient_accumulation_steps": NUM_GRAD_ACC,
        "use_amp": true
    }
}

Solution

  • By setting the max_tokens variable for the custom DatasetReader built to below 5000, this error no longer persists. This was also suggested by one of AllenNLP's contributor to make sure the tokenizer truncates the input to 5000 tokens.

    Same question was posted on AllenNLP: https://github.com/allenai/allennlp/discussions/5601