Transformers code works on its own, but breaks when using gradio (device mismatch

I am attempting to make a gradio demo for nanoLLaVA by @stablequan. I am porting over just the structure of Apache 2.0 licensed code in the Moondream repo.

The nanoLLaVA repo has example code in the repo, which I used to make this script. This works and provides a reasonable output. enter image description here When I use the same code but in gradio here, I get this error regarding a mismatch in devices.

Traceback (most recent call last):
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\queueing.py", line 495, in call_prediction
    output = await route_utils.call_process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\route_utils.py", line 232, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1561, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1179, in call_function
    prediction = await anyio.to_thread.run_sync(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\utils.py", line 678, in wrapper
    response = f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\Downloads\llm\nanollava\nanollava_gradio_demo.py", line 46, in answer_question
    output_ids = model.generate(
                 ^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py", line 1575, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py", line 2697, in _sample
    outputs = self(
              ^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\.cache\huggingface\modules\transformers_modules\qnguyen3\nanoLLaVA\4a1bd2e2854c6df9c4af831a408b14f7b035f4c0\modeling_llava_qwen2.py", line 2267, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\.cache\huggingface\modules\transformers_modules\qnguyen3\nanoLLaVA\4a1bd2e2854c6df9c4af831a408b14f7b035f4c0\modeling_llava_qwen2.py", line 687, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images).to(self.device)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\.cache\huggingface\modules\transformers_modules\qnguyen3\nanoLLaVA\4a1bd2e2854c6df9c4af831a408b14f7b035f4c0\modeling_llava_qwen2.py", line 661, in encode_images
    image_features = self.get_model().mm_projector(image_features)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moo\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Solution

Now if you've seen this before, and read through the error message, you might be able to tell me immediately, that "oh obviously its running on a different thread and set_default_device() doesn't carry over to that thread". This issue here is related to this. I'm not sure which version the fix applies to. But either way, since default is "cpu", if everything is on cpu that should be fine right?

So I modified the AutoModelForCausalLM call to use device_map='cpu', and printed the .device of both inputs and the model before executing model.generate:

<|im_start|>system
Answer the questions.<|im_end|><|im_start|>user
<image>
What do you see?<|im_end|><|im_start|>assistant

cpu
cpu
cpu

So what you need to do is figure out a way to execute set_default_device on that thread, or as seen here, you can try set_default_tensor_type, WHICH WORKS!!

# set device
torch.set_default_device('cuda')  # or 'cpu'
torch.set_default_tensor_type('torch.cuda.FloatTensor')