7.3 KiB
ollama-utils
Utilities for managing Ollama LLM models, including automated installation from HuggingFace.
TODO
| Use Case | Best Model | VRAM | Speed | Why |
|---|---|---|---|---|
| IDE Autocomplete | Qwen2.5-Coder-1.5B (Q8) | 2.5GB | 120-150 t/s | Latency critical, FIM optimized |
| Quick Drafting | Yi-Coder-9B (Q5_K_M) | 7-8GB | 50-80 t/s | Best speed/quality balance |
| Large Code Analysis | Qwen2.5-Coder-14B (Q4_K_M) | 14-16GB | 30-40 t/s | SOTA repo-level, 128K context |
| Reverse Engineering | DeepCoder-14B (Q5_K_M) | 11-12GB | 30-50 t/s | Strongest reasoning, RL-trained |
gemma3-12b-it-qat gemma3-4b-it-qat
Web Interface
Start the web interface:
./start-web.sh
Then open http://localhost:5000
The web interface provides:
- Model Management: View, install, and delete Ollama models
- Import from HuggingFace: Download GGUF models with automatic Modelfile generation (includes capabilities support)
- Performance Assessment: Test VRAM usage and optimize context size
- Real-time Monitoring: System resources, GPU metrics, and running models
Capabilities Support in Web UI: When importing models from HuggingFace, the generated Modelfile includes a # capabilities: tools comment for documentation. This helps track which models support features like tool calling or vision. Note that capabilities are automatically detected by Ollama from the GGUF file metadata.
Components
HuggingFace LLM Installer (hf-llm-install.py)
Automatically downloads GGUF files from HuggingFace and creates Ollama models based on Modelfile configurations.
Features
- Downloads GGUF models directly from HuggingFace repositories
- SHA256 checksum verification for downloaded files
- Disk space checking before downloads
- Dry run mode to preview installations
- Parallel processing for faster batch installations
- Skip existing models to avoid reinstalling
- Automatically creates Ollama models from Modelfiles
- Supports batch processing of multiple models
- Cleans up temporary files after installation
- Progress indication during downloads
Usage
# Install all models from the default modelfile-repo directory
python3 scripts/hf-llm-install.py
# Install a specific model
python3 scripts/hf-llm-install.py modelfile-repo/ministral-3-8b-instruct-2512-q5_k_m.Modelfile
# Dry run to see what would be installed
python3 scripts/hf-llm-install.py --dry-run
# Skip models that already exist in Ollama
python3 scripts/hf-llm-install.py --skip-existing
# Install with 3 parallel downloads
python3 scripts/hf-llm-install.py --parallel 3
# Install all models from a custom directory
python3 scripts/hf-llm-install.py --dir /path/to/modelfiles
Modelfile Format
Each .Modelfile must include a special header with the HuggingFace upstream URL.
Option 1: Specify exact GGUF file:
# ollama-utils-metadata
# hf_upstream: https://huggingface.co/org/repo/blob/main/model.gguf
# sha256: abc123... (optional - for checksum verification)
FROM ./model.gguf
# ... rest of your Modelfile configuration
Option 2: Specify repo and quantization:
# ollama-utils-metadata
# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-GGUF
# quantization: q5_k_m (optional - default: q4_k_m)
# capabilities: tools (optional - comma-separated list: tools, vision)
# sha256: abc123... (optional - for checksum verification)
FROM ./model.gguf
# ... rest of your Modelfile configuration
When using Option 2, the script automatically constructs the GGUF filename from the repo name and quantization level.
Capabilities Support:
The capabilities metadata field allows you to document which capabilities your model supports (like tool calling or vision). While Ollama automatically detects capabilities from the GGUF file metadata, adding this comment helps with documentation and tracking which models have which features.
Common capabilities:
tools- For models that support function/tool calling (e.g., Qwen2.5, Mistral)vision- For models that support image input (multimodal models)
Example:
# capabilities: tools
# capabilities: tools, vision
Vision Model Support (MMProj):
For vision-capable models, you can specify an mmproj (multimodal projection) file that contains the vision encoder. See MMProj Support Documentation for detailed information.
# hf_upstream: https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512-GGUF
# quantization: Q5_K_M
# capabilities: vision, reasoning, tools
#
# mmproj_url: https://huggingface.co/unsloth/Ministral-3-14B-Reasoning-2512-GGUF
# mmproj_quant: BF16
# mmproj_sha256: abc123... (optional)
The script will automatically download both the main GGUF and mmproj files, and create an Ollama model with vision support.
Note: Capabilities are read from the GGUF file's metadata by Ollama. The # capabilities: comment serves as documentation to track expected model features. If a model doesn't show the expected capabilities after installation, it may be due to the GGUF file lacking that metadata.
The script will:
- Parse the
hf_upstreamURL from the Modelfile header - Convert blob URLs to resolve URLs for direct download
- Download the GGUF file to a temporary location
- Create the Ollama model (name derived from the HuggingFace repo name)
- Clean up the temporary GGUF file
Installation Requirements
- Python 3.6+
- Ollama installed and available in PATH
- Internet connection for downloading models
Plugins
Reranking Endpoint (plugins/reranking-endpoint/)
A FastAPI service that provides document reranking using cross-encoder models (BGE-reranker, Qwen3-Reranker, etc.) via Ollama.
⚠️ Limitation: This is a workaround that uses embedding magnitudes instead of the proper classification head. Ollama doesn't expose the /api/rerank endpoint or classification layer that cross-encoder models are designed to use. Less accurate than sentence-transformers but integrated with Ollama's GPU scheduling. See plugins/reranking-endpoint/README.md for detailed limitations.
Other Scripts
context-optimizer.py- Find optimal num_ctx for models based on VRAM constraintsvram-test.py- Test models for VRAM usage and CPU offloadingmodel-info.py- Display information about Ollama modelsollama-monitor.py- Monitor Ollama service status
Repository Structure
ollama-utils/
├── web_app.py # Web interface (Flask application)
├── requirements.txt # Python dependencies
├── templates/ # HTML templates
│ └── index.html
├── static/ # CSS and JavaScript
│ ├── style.css
│ └── script.js
├── scripts/ # Utility scripts
│ ├── hf-llm-install.py
│ ├── context-optimizer.py
│ ├── vram-test.py
│ ├── model-info.py
│ └── ollama-monitor.py
├── modelfile-repo/ # Collection of pre-configured Modelfiles
│ └── *.Modelfile
└── README.md
Requirements
- Python 3.6+
- Flask 3.0.0+
- Ollama installed and running
- AMD GPU with sysfs metrics (for GPU monitoring) or NVIDIA GPU
License
See LICENSE file for details.