Ollama on Brev
Convert a model to GGUF and deploy on Ollama!
Convert a model to GGUF format!
You can take the code below and run it in a Jupyter notebook.
This guide assumes you already have a model you want to convert to GGUF format and have it in on your Brev GPU instance.
Make sure to fine-tune a model on Brev (or have a model handy that you want to convert to GGUF format) before you start!
We need to pull the llama.cpp repo from GitHub. This step might take a while, so be patient!
!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make
!pip install -r llama.cpp/requirements.txt
In the following code-block, llama-brev is an example Llama3 LLM that I fine-tuned on Brev. You can replace it with your own model.
!python llama.cpp/convert-hf-to-gguf.py llama-brev
This will quantize your model to 4-bit quantization.
!cd llama.cpp && ./quantize ../llama-brev/ggml-model-f16.gguf ../llama-brev/ggml-model-Q4_K_M.gguf Q4_K_M
If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run
curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
!cd llama.cpp && ./server -m ../merged_adapters/ggml-model-Q4_K_M.gguf -c 2048
Let's create the Ollama modelfile!
Here, we're going to start by pointing the modelfile to where our quantized model is located. We also add a fun system message to make the model talk like a pirate when you prompt it!
tuned_model_path = "/home/ubuntu/verb-workspace/llama-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"
cmds = []
base_model = f"FROM {tuned_model_path}"
template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"
"""'''
params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''
system = f'''SYSTEM """{sys_message}"""'''
cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)
def generate_modelfile(cmds):
content = ""
for command in cmds:
content += command + "\n"
print(content)
with open("Modelfile", "w") as file:
file.write(content)
generate_modelfile(cmds)
!curl -fsSL https://ollama.com/install.sh | sh
Let's start the Ollama server and push our modelfile to the Ollama registry so you can now run it locally!
!ollama create llama-brev -f Modelfile
Let's run the model on Ollama!
Now that we have our modelfile and Ollama server running, we should use it to run our fine-tuned model on Ollama! This guide assumes you have Ollama already installed and running on your laptop. If you don't, you can follow the instructions here.
To run our fine-tuned model on Ollama, open up your terminal and run:
ollama pull llama-brev
Remember, llama-brev is the name of my fine-tuned model and what I named my modelfile when I pushed it to the Ollama registry. You can replace it with your own model name and modelfile name.
To query it, run:
ollama run llama-brev
Since my system message is a pirate, when I said Hi!, my model responded with: "Ahoy, matey! Ye be lookin' mighty fine today. Hoist the colors and let's set sail on a grand adventure! Arrr!"