AI
How to run Ollama on Brev đ€
Let's just start with the important stuff
Here's how to run it. I'll explain after.
If you want the magic way of running Ollama on Brev, run:
brev ollama -m <insert model name>
This automatically sets up a new instance with Ollama and provides an endpoint for you to interact with.
Or if you want to do it manually / use OpenWebUI, follow the steps below:
brev shell <instance-name> --host
host so you can install docker and have sudo permissions
curl -fsSL https://ollama.com/install.sh | sh
ollama pull <insert model name>
- phi3
- llama3
- dolphin-mixtral
- gemma
- mixtral
- llava
- nous-hermes2
- Download open-webui onto your instance
sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always http://ghcr.io/open-webui/open-webui:ollama
Unlocking the Power of LLMs with Ollama and Brev.dev
LLMs like Llama3 and Mistral, essential for developers, researchers, and businesses, are revolutionizing tech and inspiring brand-new applications. However, their high computational requirements and technical expertise often limit accessibility for individual developers and smaller teams.
Enter Ollama, an open-source tool that democratizes the use of LLMs by enabling users to run them locally on their own machines. Ollama simplifies the complex process of setting up LLMs by bundling model weights, configurations, and datasets into a unified "Modelfile." This approach not only democratizes these models but also optimizes their performance, especially in CPU environments. One of the key advantages of Ollama is its ability to also run efficiently on GPU-accelerated cloud infrastructure. By leveraging the power of GPUs, Ollama can process and generate text at lightning-fast speeds, making it an ideal choice for applications that require real-time or high-throughput language processing.
Enter Brev.dev. As a platform designed to make it easy to deploy on GPUs, Brev allows users to easily provision a GPU and set up a Linux VM. This setup is super ideal for running sophisticated models via Ollama, providing a seamless experience from model selection to execution.
Together, Ollama and Brev.dev offer a powerful combination for anyone looking to use LLMs without the traditional complexities of setup and optimization. Using Ollama on GPUs provide the flexibility and control needed to push the boundaries of whatâs possible with AI. Letâs dive into how to deploy Ollama on Brev.dev below. This guide was inspired by Matt Williams, co-founder of Ollama, who makes a great video about deploying Ollama on Brev here!
Getting Started
To get started running Ollama on a Brev GPU, first create an account and grab some credits. We recommend creating a T4 instance, which you can grab for <$1/hr. Name the instance ollama
and wait for it to deploy!
Once it loads, move over to the Access tab to find details on how to SSH into the machine using the CLI. If youâre a first time Brev user, youâll need to install the Brev CLI. You can find the command under Install the CLI â copy and paste into your terminal. Follow the prompted steps to authenticate!
After youâre set up with the Brev CLI, run the brev refresh
command to refresh and then run brev ls
to view all of your instances running. You should see your ollama
instance running. Itâs now time to SSH into it!
The CLI command on the console shows how you can SSH into the Verb container we install onto GPUs to handle Python / CUDA dependencies. To ensure we donât run into any issues, weâre going to SSH into the host VM itself. Run this command to SSH in: brev shell ollama --host
*Itâs very important you SSH into the VM using the --host flag at the end of your command!
You should now see yourself as an Ubuntu host user. Itâs now time to install Ollama! Run this Linux script to install: curl -fsSL https://ollama.com/install.sh | sh
Now that Ollama is installed, letâs download a model. You can run ollama pull <model name>
to download any model. Full list of models can be found here. At the time of this post being written, Llama3 is the most popular open-source model so we will mess with it. Run ollama pull llama3
.
Once your model is downloaded, you can query it with ollama run <model name>
. Since we are working with llama3, I will run ollama run llama3
. Iâm going to ask it a simple question.
Youâre now setup with running Ollama on Brev.dev! You can stop your instance at anytime if youâre not using it and restart it whenever you need access to your Ollama model. If youâre interested in setting up a OpenWebUI for Ollama (think ChatGPT-style interface), you can find instructions for setting it up below:
Run this command to download OpenWebUI:
sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always [ghcr.io/open-webui/open-webui:ollama](http://ghcr.io/open-webui/open-webui:ollama)
Go back to console.brev.dev and go to your instance page details. Weâll need to export a port on our server to grab a link for accessing OpenWebUI. The command you just ran sets up OpenWebUI to run on port 3000 on the GPU. Youâll need to go to the âAccessâ tab of your instance and scroll down to Using Tunnels section. If you want to share a link with others, click on the blue âShare a Serviceâ button and type in 3000 in the Port Number field. If youâre fine with getting a link to mess around with on your browser, expose port 3000 in the Using Ports section. You can now click on the link to launch OpenWebUI!
Youâll be prompted to login/create an account for OpenWebUI. Per my understanding, this stays locally on your machine and doesnât get shared with anyone. This is useful if youâre running a service via OpenWebUI. Otherwise, just type a random login like I did!
OpenWebUI is super cool! You can pull in any models from this library or even upload your own modelfiles. I decided to pull in Llama3 directly from the UI and ask it to make a haiku about running Ollama on Brevđ€Ł
Whatâs really interesting with OpenWebUI is you can upload your own Modelfiles super easily through the interface. These modelfiles are essentially just a configuration file that defines and manages models on the Ollama platform. Develop new models or modify existing ones using model files to handle specific application scenarios. Custom prompts are included in the model, and you can adjust the context length, temperature, and random seeds. You can also reduce nonsensical outputs and vary the diversity of the text output. (Note: This process involves adjusting the model's original parameters, not fine-tuning it.) These Modelfiles enable you to talk to diverse characters and assistants, making your chat interactions super unique and tailored towards a use case.
You can even perform RAG through the UI! Upload documents, PDFs, etc. and an embedding model of your choice can create chunks, with which basic RAG can be performed.
When youâre done using OpenWebUI/Ollama, feel free to stop the instance to save yourself from unnecessary charges.. Our UI will show the stopped cost for leaving it running (usually a couple cents/hour) and it makes it super easy to start where you left off by restarting it whenever you need.