Ollama with Sugoi LLM 14B/32B

Download Required Files
- Download Sugoi14B or Sugoi32B model files in GGUF format.
https://www.patreon.com/posts/sugoi-llm-14b-131493423
- Download and install Ollama from: https://ollama.com/
Start Ollama
- Start
ollama serve
in terminal manually, or check for tray icon (auto-started if present).
Create a Modelfile
Create a file named sugoi14b.modelfile
with the following content:
# replace with correct <path>
FROM /<path>/sugoi14b.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_k 40
PARAMETER top_p 0.9
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""
SYSTEM You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
LICENSE """""""""""
Load the Model
Run the following command:
ollama create sugoi14b --file sugoi14b.modelfile
- Wait for completion
- Confirm with:
ollama list
Running the Model
- Automatically runs when called from API
- Manual run:
ollama run sugoi14b
- Runs at:
http://localhost:11434
- Accessible via OpenAI-compatible APIs
Settings (default)
- Default settings: Context Window Size: 4096 – This is the maximum number of tokens the model can consider at once. A larger size allows for longer input context.
- Temperature: 0.7 – Controls randomness; lower values make outputs more focused and deterministic. Some users like 0.3.
- Top P: 0.9 – Used in nucleus sampling; the model considers the smallest set of top tokens whose probabilities sum to 90%. Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
- Top K: 40 – Limits sampling to the top 40 most likely next tokens. Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative.
- Repeat Penalty: 1.1 – Penalizes repeated phrases to reduce repetition; 1.0 means no penalty.
- Layer GPU: 14b 49 / 32b 65 – Refers to how many transformer layers are offloaded to the GPU; maximum is 49/65, so this uses full GPU acceleration but even at lower it is not too bad of waiting time when used with streaming output.
- Note: Do experiment with available options based on your hardware capability.