pythonpyngrokollama

How to run ollama in google colab?


I have a code like this. And I'm launching it. I get an ngrok link.

!pip install aiohttp pyngrok

import os
import asyncio
from aiohttp import ClientSession

# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred
# over the built-in library. This is particularly important for
# Google Colab which installs older drivers
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

async def run(cmd):
  '''
  run is a helper function to run subcommands asynchronously.
  '''
  print('>>> starting', *cmd)
  p = await asyncio.subprocess.create_subprocess_exec(
      *cmd,
      stdout=asyncio.subprocess.PIPE,
      stderr=asyncio.subprocess.PIPE,
  )

  async def pipe(lines):
    async for line in lines:
      print(line.strip().decode('utf-8'))

  await asyncio.gather(
      pipe(p.stdout),
      pipe(p.stderr),
  )


await asyncio.gather(
    run(['ollama', 'serve']),
    run(['ngrok', 'http', '--log', 'stderr', '11434']),
)

Which I'm following, but the following is on the page

enter image description here

How can I fix this? Before that, I did the following

!choco install ngrok
!ngrok config add-authtoken -----
!curl https://ollama.ai/install.sh | sh
!command -v systemctl >/dev/null && sudo systemctl stop ollama

Solution

  • 1. Run ollama but don't stop it

    !curl https://ollama.ai/install.sh | sh
    
    # should produce, among other thigns:
    # The Ollama API is now available at 0.0.0.0:11434
    

    This means Ollama is running (but do check to see if there are errors, especially around graphics capability/Cuda as these may interfere.

    However, Don't run !command -v systemctl >/dev/null && sudo systemctl stop ollama (unless you want to stop Ollama).

    The next step is to start the Ollama service, but since you are using ngrok I'm assuming you want to be able to run the LLM from other environments outside the Colab? If this isn't the case, then you don't really need ngrok, but since Colabs are tricky to get working nicely with async code and threads it's useful to use the Colab to e.g. run a powerful enough VM to play with larger models than (say) anthing you could run on your dev environment (if this is an issue).

    2. Set up ngrok and forward the local ollama service to a public URI

    Ollama isn't yet running as a service but we can set up ngrok in advance of this:

    import threading
    import time
    import os
    import asyncio
    from pyngrok import ngrok
    import threading
    import queue
    import time
    from threading import Thread
    
    # Get your ngrok token from your ngrok account:
    # https://dashboard.ngrok.com/get-started/your-authtoken
    token="your token goes here - don't forget to replace this with it!"
    ngrok.set_auth_token(token)
    
    # set up a stoppable thread (not mandatory, but cleaner if you want to stop this later
    class StoppableThread(threading.Thread):
        def __init__(self, *args, **kwargs):
            super(StoppableThread, self).__init__(*args, **kwargs)
            self._stop_event = threading.Event()
    
        def stop(self):
            self._stop_event.set()
    
        def is_stopped(self):
            return self._stop_event.is_set()
    
    def start_ngrok(q, stop_event):
        try:
            # Start an HTTP tunnel on the specified port
            public_url = ngrok.connect(11434)
            # Put the public URL in the queue
            q.put(public_url)
            # Keep the thread alive until stop event is set
            while not stop_event.is_set():
                time.sleep(1)  # Adjust sleep time as needed
        except Exception as e:
            print(f"Error in start_ngrok: {e}")
    

    Run that code so the functions exist, then in the next cell, start ngrok in a separate thread so it doesn't hang your colab - we'll use a queue so we can still share data between threads because we want to know what the ngrok public URL will be when it runs:

    # Create a queue to share data between threads
    url_queue = queue.Queue()
    
    # Start ngrok in a separate thread
    ngrok_thread = StoppableThread(target=start_ngrok, args=(url_queue, StoppableThread.is_stopped))
    ngrok_thread.start()
    

    That will be running, but you need to get the results from the queue to see what ngrok returned, so then do:

    # Wait for the ngrok tunnel to be established
    while True:
        try:
            public_url = url_queue.get()
            if public_url:
                break
            print("Waiting for ngrok URL...")
            time.sleep(1)
        except Exception as e:
            print(f"Error in retrieving ngrok URL: {e}")
    
    print("Ngrok tunnel established at:", public_url)
    

    This should output something like:

    Ngrok tunnel established at: NgrokTunnel: "https://{somelongsubdomain}.ngrok-free.app" -> "http://localhost:11434"
    

    3. Run ollama as an async process

    import os
    import asyncio
    
    # NB: You may need to set these depending and get cuda working depending which backend you are running.
    # Set environment variable for NVIDIA library
    # Set environment variables for CUDA
    os.environ['PATH'] += ':/usr/local/cuda/bin'
    # Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
    os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'
    
    async def run_process(cmd):
        print('>>> starting', *cmd)
        process = await asyncio.create_subprocess_exec(
            *cmd,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE
        )
    
        # define an async pipe function
        async def pipe(lines):
            async for line in lines:
                print(line.decode().strip())
    
            await asyncio.gather(
                pipe(process.stdout),
                pipe(process.stderr),
            )
    
        # call it
        await asyncio.gather(pipe(process.stdout), pipe(process.stderr))
    

    That creates the function to run an async command but doesn't run it yet.

    This will start ollama in a separate thread so your Colab isn't blocked:

    import asyncio
    import threading
    
    async def start_ollama_serve():
        await run_process(['ollama', 'serve'])
    
    def run_async_in_thread(loop, coro):
        asyncio.set_event_loop(loop)
        loop.run_until_complete(coro) 
        loop.close()
    
    # Create a new event loop that will run in a new thread 
    new_loop = asyncio.new_event_loop() 
    
    # Start ollama serve in a separate thread so the cell won't block execution 
    thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
    thread.start() 
    

    It should produce something like:

    >>> starting ollama serve
    Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
    Your new public key is:
    
    ssh-ed25519 {some key}
    
    2024/01/16 20:19:11 images.go:808: total blobs: 0
    2024/01/16 20:19:11 images.go:815: total unused blobs removed: 0
    2024/01/16 20:19:11 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)
    

    Now you're all set up. You can either do the next steps in the Colab, but it might be easier to run on your local machine if you normally dev there.

    4. Run an ollama model remotely from your local dev environment

    Assuming you have installed ollama on your local dev environment (say WSL2), I'm assuming it's linux anyway... but i.e. your laptop or desktop machine in front of you (as opposed to Colab).

    Replace the actual URI below with whatever public URI ngrok reported above:

    export OLLAMA_HOST=https://{longcode}.ngrok-free.app/
    

    You can now run ollama and it will run on the remote in your Colab (so long as that's stays up and running).

    e.g. run this on your local machine and it will look as if it's running locally but it's really running in your Colab and the results are being served to wherever you call this from (so long as the OLLAMA_HOST is set correctly and is a valid tunnel to your ollama service:

    ollama run mistral
    

    You can now interact with the model on the command line locally but the model runs on the Colab.

    If you want to run larger models, like mixtral, then you need to be sure to connect your Colab to a Back end compute that's powerful enough (e.g. 48GB+ of RAM, so V100 GPU is minimum spec for this at the time of writing).

    Note: If you have any issues with cuda or nvidia showing in the ouputs of any steps above, don't proceed until you fix them.

    Hope that helps!

    Gruff