
RuntimeError: Unknown device when trying to run AlbertForMaskedLM on colab tpu

I am running the following code on colab taken from the example here:

import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm

assert os.environ['COLAB_TPU_ADDR']

dev = xm.xla_device()

from transformers import AlbertTokenizer, AlbertForMaskedLM
import torch

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
model = AlbertForMaskedLM.from_pretrained('albert-base-v2').to(dev)
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1

data =

outputs = model(data, masked_lm_labels=data)
loss, prediction_scores = outputs[:2]

I haven't done anything to the example code except move input_ids and model onto the TPU device using .to(dev). It seems everything is moved to the TPU no problem as when I input data I get the following output: tensor([[ 2, 10975, 15, 51, 1952, 25, 10901, 3]], device='xla:1')

However when I run this code I get the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-5-f756487db8f7> in <module>()
----> 2 outputs = model(data, masked_lm_labels=data)
      3 loss, prediction_scores = outputs[:2]

9 frames
/usr/local/lib/python3.6/dist-packages/transformers/ in forward(self, hidden_states, attention_mask, head_mask)
    277         attention_output = self.attention(hidden_states, attention_mask, head_mask)
    278         ffn_output = self.ffn(attention_output[0])
--> 279         ffn_output = self.activation(ffn_output)
    280         ffn_output = self.ffn_output(ffn_output)
    281         hidden_states = self.full_layer_layer_norm(ffn_output + attention_output[0])

RuntimeError: Unknown device

Anyone know what's going on?


  • Solution is here:

    Before calling, you need to call xm.send_cpu_data_to_device(model, xm.xla_device()):

    model = AlbertForMaskedLM.from_pretrained('albert-base-v2')
    model = xm.send_cpu_data_to_device(model, dev)
    model =

    There are also some issues with getting the gelu activation function ALBERT uses to work on the TPU, so you need to use the following branch of transformers when working on TPU:

    See the following colab notebook (by for full solution: