pythonnlphuggingface-transformerslangchain

RuntimeError: Expected all tensors to be on the same device when using local HuggingFace model in LangChain Agent


I'm building a simple agent using LangChain that leverages a locally-hosted HuggingFace model (gpt-oss-20b). I'm using the transformers pipeline and wrapping it in LangChain's HuggingFacePipeline.

The model loads correctly onto the GPU using device_map="auto", but when the AgentExecutor is invoked, it fails with a RuntimeError related to tensor device placement.

The core of the error is:

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0

Here is the complete, reproducible script:

import os
from langchain_community.tools import DuckDuckGoSearchRun
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline
import torch

# 1. Set up the Language Model
# This points to a local directory containing the model files.
model_path = "../gpt-oss-20b-local" 

try:
    # Create a transformers pipeline for text generation
    local_pipe = pipeline(
        "text-generation",
        model=model_path,
        dtype="auto",
        device_map="auto", # Should handle placing the model on GPU
        max_new_tokens=256,
    )

    # Wrap the pipeline for LangChain
    llm = HuggingFacePipeline(
        pipeline=local_pipe,
        model_kwargs={"temperature": 0.5},
    )
    print("Local LLM Loaded successfully.")

except Exception as e:
    print(f"Error loading local model: {e}")
    exit()

# 2. Define Tools
search = DuckDuckGoSearchRun()
tools = [search]

# 3. Create Prompt
template = """
Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
"""
prompt = PromptTemplate.from_template(template)


# 4. Create Agent and Executor
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

print("Agent Executor created. Ready to receive input.")
print("=" * 50)

# 6. Run the Agent
question = "Who is the current prime minister of the United Kingdom and what is their political party?"
response = agent_executor.invoke({"input": question}) # Error occurs here

print("-" * 50)
print(f"Final Response: {response['output']}")

When I run the script, I get this Error:

Fetching 40 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 3852.05it/s]
Fetching 40 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 2968.16it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.95it/s]
Device set to use cuda:0
/meysam/test-oss/agent.py:28: LangChainDeprecationWarning: The class `HuggingFacePipeline` was deprecated in LangChain 0.0.37 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFacePipeline``.
  llm = HuggingFacePipeline(
Local LLM Loaded successfully.
Tools defined.
Agent created.
Agent Executor created. Ready to receive input.
==================================================


> Entering new AgentExecutor chain...
/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py:2412: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(
Exception in thread Thread-3 (generate):
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py", line 2539, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py", line 2867, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/utils/generic.py", line 940, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 663, in forward
    outputs: MoeModelOutputWithPast = self.model(
                                      ^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 474, in forward
    inputs_embeds = self.embed_tokens(input_ids)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 192, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/functional.py", line 2546, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)

How can I fix it?


Solution

  • I used this code and it worked.

    llm = HuggingFacePipeline(
            pipeline=local_pipe,
            model_kwargs={"temperature": 0.5, 'device':0},
        )
    

    I set device to cuda.