How to run Ollama on Brev 🤙

Let's just start with the important stuff

Here's how to run it. I'll explain after.

brev shell <instance-name> --host

host so you can install docker and have sudo permissions

curl -fsSL https://ollama.com/install.sh | sh
ollama pull <enter model name>
1. phi3
2. llama3
3. dolphin-mixtral
4. gemma
5. mixtral
6. llava
7. nous-hermes2
Download open-webui onto your instance sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always http://ghcr.io/open-webui/open-webui:ollama

Unlocking the Power of LLMs with Ollama and Brev.dev

LLMs like Llama3 and Mistral, essential for developers, researchers, and businesses, are revolutionizing tech and inspiring brand-new applications. However, their high computational requirements and technical expertise often limit accessibility for individual developers and smaller teams.

Enter Ollama, an open-source tool that democratizes the use of LLMs by enabling users to run them locally on their own machines. Ollama simplifies the complex process of setting up LLMs by bundling model weights, configurations, and datasets into a unified "Modelfile." This approach not only democratizes these models but also optimizes their performance, especially in CPU environments. One of the key advantages of Ollama is its ability to also run efficiently on GPU-accelerated cloud infrastructure. By leveraging the power of GPUs, Ollama can process and generate text at lightning-fast speeds, making it an ideal choice for applications that require real-time or high-throughput language processing.

Enter Brev.dev. As a platform designed to make it easy to deploy on GPUs, Brev allows users to easily provision a GPU and set up a Linux VM. This setup is super ideal for running sophisticated models via Ollama, providing a seamless experience from model selection to execution.

Together, Ollama and Brev.dev offer a powerful combination for anyone looking to use LLMs without the traditional complexities of setup and optimization. Using Ollama on GPUs provide the flexibility and control needed to push the boundaries of what’s possible with AI. Let’s dive into how to deploy Ollama on Brev.dev below. This guide was inspired by Matt Williams, co-founder of Ollama, who makes a great video about deploying Ollama on Brev here!

Getting Started

To get started running Ollama on a Brev GPU, first create an account and grab some credits. We recommend creating a T4 instance, which you can grab for <$1/hr. Name the instance ollama and wait for it to deploy!

Once it loads, move over to the Access tab to find details on how to SSH into the machine using the CLI. If you’re a first time Brev user, you’ll need to install the Brev CLI. You can find the command under Install the CLI — copy and paste into your terminal. Follow the prompted steps to authenticate!

After you’re set up with the Brev CLI, run the brev refresh command to refresh and then run brev ls to view all of your instances running. You should see your ollama instance running. It’s now time to SSH into it!

The CLI command on the console shows how you can SSH into the Verb container we install onto GPUs to handle Python / CUDA dependencies. To ensure we don’t run into any issues, we’re going to SSH into the host VM itself. Run this command to SSH in: brev shell ollama --host

*It’s very important you SSH into the VM using the --host flag at the end of your command!

You should now see yourself as an Ubuntu host user. It’s now time to install Ollama! Run this Linux script to install: curl -fsSL https://ollama.com/install.sh | sh

Now that Ollama is installed, let’s download a model. You can run ollama pull <model name> to download any model. Full list of models can be found here. At the time of this post being written, Llama3 is the most popular open-source model so we will mess with it. Run ollama pull llama3.

Once your model is downloaded, you can query it with ollama run <model name>. Since we are working with llama3, I will run ollama run llama3 . I’m going to ask it a simple question.

You’re now setup with running Ollama on Brev.dev! You can stop your instance at anytime if you’re not using it and restart it whenever you need access to your Ollama model. If you’re interested in setting up a OpenWebUI for Ollama (think ChatGPT-style interface), you can find instructions for setting it up below:

Run this command to download OpenWebUI:

sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always [ghcr.io/open-webui/open-webui:ollama](http://ghcr.io/open-webui/open-webui:ollama)

Go back to console.brev.dev and go to your instance page details. We’ll need to export a port on our server to grab a link for accessing OpenWebUI. The command you just ran sets up OpenWebUI to run on port 3000 on the GPU. You’ll need to go to the “Access” tab of your instance and scroll down to Using Tunnels section. If you want to share a link with others, click on the blue “Share a Service” button and type in 3000 in the Port Number field. If you’re fine with getting a link to mess around with on your browser, expose port 3000 in the Using Ports section. You can now click on the link to launch OpenWebUI!

You’ll be prompted to login/create an account for OpenWebUI. Per my understanding, this stays locally on your machine and doesn’t get shared with anyone. This is useful if you’re running a service via OpenWebUI. Otherwise, just type a random login like I did!

OpenWebUI is super cool! You can pull in any models from this library or even upload your own modelfiles. I decided to pull in Llama3 directly from the UI and ask it to make a haiku about running Ollama on Brev🤣

What’s really interesting with OpenWebUI is you can upload your own Modelfiles super easily through the interface. These modelfiles are essentially just a configuration file that defines and manages models on the Ollama platform. Develop new models or modify existing ones using model files to handle specific application scenarios. Custom prompts are included in the model, and you can adjust the context length, temperature, and random seeds. You can also reduce nonsensical outputs and vary the diversity of the text output. (Note: This process involves adjusting the model's original parameters, not fine-tuning it.) These Modelfiles enable you to talk to diverse characters and assistants, making your chat interactions super unique and tailored towards a use case.

You can even perform RAG through the UI! Upload documents, PDFs, etc. and an embedding model of your choice can create chunks, with which basic RAG can be performed.

When you’re done using OpenWebUI/Ollama, feel free to stop the instance to save yourself from unnecessary charges.. Our UI will show the stopped cost for leaving it running (usually a couple cents/hour) and it makes it super easy to start where you left off by restarting it whenever you need.