I'm building a simple agent using LangChain that leverages a locally-hosted HuggingFace model (gpt-oss-20b). I'm using the transformers pipeline and wrapping it in LangChain's HuggingFacePipeline.
The model loads correctly onto the GPU using device_map="auto", but when the AgentExecutor is invoked, it fails with a RuntimeError related to tensor device placement.
The core of the error is:
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0
Here is the complete, reproducible script:
import os
from langchain_community.tools import DuckDuckGoSearchRun
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.prompts import PromptTemplate
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline
import torch
# 1. Set up the Language Model
# This points to a local directory containing the model files.
model_path = "../gpt-oss-20b-local"
try:
# Create a transformers pipeline for text generation
local_pipe = pipeline(
"text-generation",
model=model_path,
dtype="auto",
device_map="auto", # Should handle placing the model on GPU
max_new_tokens=256,
)
# Wrap the pipeline for LangChain
llm = HuggingFacePipeline(
pipeline=local_pipe,
model_kwargs={"temperature": 0.5},
)
print("Local LLM Loaded successfully.")
except Exception as e:
print(f"Error loading local model: {e}")
exit()
# 2. Define Tools
search = DuckDuckGoSearchRun()
tools = [search]
# 3. Create Prompt
template = """
Answer the following questions as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}
"""
prompt = PromptTemplate.from_template(template)
# 4. Create Agent and Executor
agent = create_react_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
print("Agent Executor created. Ready to receive input.")
print("=" * 50)
# 6. Run the Agent
question = "Who is the current prime minister of the United Kingdom and what is their political party?"
response = agent_executor.invoke({"input": question}) # Error occurs here
print("-" * 50)
print(f"Final Response: {response['output']}")
When I run the script, I get this Error:
Fetching 40 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 3852.05it/s]
Fetching 40 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 2968.16it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.95it/s]
Device set to use cuda:0
/meysam/test-oss/agent.py:28: LangChainDeprecationWarning: The class `HuggingFacePipeline` was deprecated in LangChain 0.0.37 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFacePipeline``.
llm = HuggingFacePipeline(
Local LLM Loaded successfully.
Tools defined.
Agent created.
Agent Executor created. Ready to receive input.
==================================================
> Entering new AgentExecutor chain...
/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py:2412: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
warnings.warn(
Exception in thread Thread-3 (generate):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py", line 2539, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/generation/utils.py", line 2867, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/utils/generic.py", line 940, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 663, in forward
outputs: MoeModelOutputWithPast = self.model(
^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/utils/generic.py", line 1064, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 474, in forward
inputs_embeds = self.embed_tokens(input_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 192, in forward
return F.embedding(
^^^^^^^^^^^^
File "/meysam/envs/new_env/lib/python3.12/site-packages/torch/nn/functional.py", line 2546, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)
How can I fix it?
I used this code and it worked.
llm = HuggingFacePipeline(
pipeline=local_pipe,
model_kwargs={"temperature": 0.5, 'device':0},
)
I set device to cuda.