Ollama with Sugoi LLM 14B/32B

Ollama with Sugoi LLM 14B/32B

Download Required Files

  1. Download Sugoi14B or Sugoi32B model files in GGUF format.

https://www.patreon.com/posts/sugoi-llm-14b-131493423

  1. Download and install Ollama from: https://ollama.com/

Start Ollama

  • Start ollama serve in terminal manually, or check for tray icon (auto-started if present).

Create a Modelfile

Create a file named sugoi14b.modelfile with the following content:

# replace with correct <path>
FROM /<path>/sugoi14b.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_k 40
PARAMETER top_p 0.9

TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""
SYSTEM You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
LICENSE """""""""""

Load the Model

Run the following command:

ollama create sugoi14b --file sugoi14b.modelfile

  • Wait for completion
  • Confirm with: ollama list

Running the Model

  • Automatically runs when called from API
  • Manual run: ollama run sugoi14b
  • Runs at: http://localhost:11434
  • Accessible via OpenAI-compatible APIs

Settings (default)

  • Default settings: Context Window Size: 4096 – This is the maximum number of tokens the model can consider at once. A larger size allows for longer input context.
  • Temperature: 0.7 – Controls randomness; lower values make outputs more focused and deterministic. Some users like 0.3.
  • Top P: 0.9 – Used in nucleus sampling; the model considers the smallest set of top tokens whose probabilities sum to 90%. Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)
  • Top K: 40 – Limits sampling to the top 40 most likely next tokens. Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative.
  • Repeat Penalty: 1.1 – Penalizes repeated phrases to reduce repetition; 1.0 means no penalty.
  • Layer GPU: 14b 49 / 32b 65 – Refers to how many transformer layers are offloaded to the GPU; maximum is 49/65, so this uses full GPU acceleration but even at lower it is not too bad of waiting time when used with streaming output.
  • Note: Do experiment with available options based on your hardware capability.