initial commit

2026-01-18 22:01:50 +01:00
parent ab25613358
commit c40874d7f0
20 changed files with 6287 additions and 11 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -174,3 +174,4 @@ cython_debug/
 # PyPI configuration file
 .pypirc

+.venv/
--- a/20
+++ b/20
@@ -1,18 +1,18 @@
-MIT License
+# MIT License

 Copyright (c) 2026 mstoeck3

-Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 
-associated documentation files (the "Software"), to deal in the Software without restriction, including 
-without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
-copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
+associated documentation files (the "Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the
 following conditions:

-The above copyright notice and this permission notice shall be included in all copies or substantial 
+The above copyright notice and this permission notice shall be included in all copies or substantial
 portions of the Software.

-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 
-LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO 
-EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 
-IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE 
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
+LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO
+EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
 USE OR OTHER DEALINGS IN THE SOFTWARE.
--- a/README.md
+++ b/README.md
@@ -1,3 +1,164 @@
 # ollama-utils

-ollama management utilities
+Utilities for managing Ollama LLM models, including automated installation from HuggingFace.
+
+## Web Interface
+
+Start the web interface:
+```bash
+./start-web.sh
+```
+
+Then open http://localhost:5000
+
+The web interface provides:
+
+- **Model Management**: View, install, and delete Ollama models
+- **Import from HuggingFace**: Download GGUF models with automatic Modelfile generation (includes capabilities support)
+- **Performance Assessment**: Test VRAM usage and optimize context size
+- **Real-time Monitoring**: System resources, GPU metrics, and running models
+
+**Capabilities Support in Web UI**: When importing models from HuggingFace, the generated Modelfile includes a `# capabilities: tools` comment for documentation. This helps track which models support features like tool calling or vision. Note that capabilities are automatically detected by Ollama from the GGUF file metadata.
+
+## Components
+
+### HuggingFace LLM Installer (`hf-llm-install.py`)
+
+Automatically downloads GGUF files from HuggingFace and creates Ollama models based on Modelfile configurations.
+
+#### Features
+
+- Downloads GGUF models directly from HuggingFace repositories
+- **SHA256 checksum verification** for downloaded files
+- **Disk space checking** before downloads
+- **Dry run mode** to preview installations
+- **Parallel processing** for faster batch installations
+- **Skip existing models** to avoid reinstalling
+- Automatically creates Ollama models from Modelfiles
+- Supports batch processing of multiple models
+- Cleans up temporary files after installation
+- Progress indication during downloads
+
+#### Usage
+
+```bash
+# Install all models from the default modelfile-repo directory
+python3 scripts/hf-llm-install.py
+
+# Install a specific model
+python3 scripts/hf-llm-install.py modelfile-repo/ministral-3-8b-instruct-2512-q5_k_m.Modelfile
+
+# Dry run to see what would be installed
+python3 scripts/hf-llm-install.py --dry-run
+
+# Skip models that already exist in Ollama
+python3 scripts/hf-llm-install.py --skip-existing
+
+# Install with 3 parallel downloads
+python3 scripts/hf-llm-install.py --parallel 3
+
+# Install all models from a custom directory
+python3 scripts/hf-llm-install.py --dir /path/to/modelfiles
+```
+
+#### Modelfile Format
+
+Each `.Modelfile` must include a special header with the HuggingFace upstream URL.
+
+**Option 1: Specify exact GGUF file:**
+
+```dockerfile
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/org/repo/blob/main/model.gguf
+# sha256: abc123... (optional - for checksum verification)
+FROM ./model.gguf
+
+# ... rest of your Modelfile configuration
+```
+
+**Option 2: Specify repo and quantization:**
+
+```dockerfile
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-GGUF
+# quantization: q5_k_m (optional - default: q4_k_m)
+# capabilities: tools (optional - comma-separated list: tools, vision)
+# sha256: abc123... (optional - for checksum verification)
+FROM ./model.gguf
+
+# ... rest of your Modelfile configuration
+```
+
+When using Option 2, the script automatically constructs the GGUF filename from the repo name and quantization level.
+
+**Capabilities Support:**
+
+The `capabilities` metadata field allows you to document which capabilities your model supports (like tool calling or vision). While Ollama automatically detects capabilities from the GGUF file metadata, adding this comment helps with documentation and tracking which models have which features.
+
+Common capabilities:
+
+- `tools` - For models that support function/tool calling (e.g., Qwen2.5, Mistral)
+- `vision` - For models that support image input (multimodal models)
+
+Example:
+
+```dockerfile
+# capabilities: tools
+# capabilities: tools, vision
+```
+
+**Note:** Capabilities are read from the GGUF file's metadata by Ollama. The `# capabilities:` comment serves as documentation to track expected model features. If a model doesn't show the expected capabilities after installation, it may be due to the GGUF file lacking that metadata.
+
+The script will:
+
+1. Parse the `hf_upstream` URL from the Modelfile header
+2. Convert blob URLs to resolve URLs for direct download
+3. Download the GGUF file to a temporary location
+4. Create the Ollama model (name derived from the HuggingFace repo name)
+5. Clean up the temporary GGUF file
+
+#### Installation Requirements
+
+- Python 3.6+
+- Ollama installed and available in PATH
+- Internet connection for downloading models
+
+### Other Scripts
+
+- `context-optimizer.py` - Find optimal num_ctx for models based on VRAM constraints
+- `vram-test.py` - Test models for VRAM usage and CPU offloading
+- `model-info.py` - Display information about Ollama models
+- `ollama-monitor.py` - Monitor Ollama service status
+
+## Repository Structure
+
+```text
+ollama-utils/
+├── web_app.py         # Web interface (Flask application)
+├── requirements.txt   # Python dependencies
+├── templates/         # HTML templates
+│   └── index.html
+├── static/            # CSS and JavaScript
+│   ├── style.css
+│   └── script.js
+├── scripts/           # Utility scripts
+│   ├── hf-llm-install.py
+│   ├── context-optimizer.py
+│   ├── vram-test.py
+│   ├── model-info.py
+│   └── ollama-monitor.py
+├── modelfile-repo/    # Collection of pre-configured Modelfiles
+│   └── *.Modelfile
+└── README.md
+```
+
+## Requirements
+
+- Python 3.6+
+- Flask 3.0.0+
+- Ollama installed and running
+- AMD GPU with sysfs metrics (for GPU monitoring) or NVIDIA GPU
+
+## License
+
+See [LICENSE](LICENSE) file for details.
--- a/modelfile-repo/ministral-3-14b-instruct-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-14b-instruct-2512-q5_k_m.Modelfile
@@ -0,0 +1,20 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-GGUF/blob/main/Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+# capabilities: tools,vision
+FROM ./Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+
+# Specialized parser for Mistral 3 logic
+PARSER ministral
+
+# Full context and recommended stability parameters
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.15
+PARAMETER top_p 1.0
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+
+# Template: Basic structure that lets the PARSER handle the heavy lifting
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """You are a helpful and concise AI assistant."""
--- a/modelfile-repo/ministral-3-14b-reasoning-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-14b-reasoning-2512-q5_k_m.Modelfile
@@ -0,0 +1,26 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Reasoning-2512-gguf
+# quantization: q5_k_m
+# capabilities: tools,vision,thinking
+FROM ./Ministral-3-3B-Reasoning-2512-Q5_K_M.gguf
+
+# Specialized parser
+PARSER ministral
+
+# Parameters tuned for logic/reasoning
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.7
+PARAMETER top_p 0.95
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+PARAMETER stop "[/THINK]"
+
+# Template
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """# HOW YOU SHOULD THINK AND ANSWER
+
+First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
+
+Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response."""
--- a/modelfile-repo/ministral-3-3b-instruct-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-3b-instruct-2512-q5_k_m.Modelfile
@@ -0,0 +1,20 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-GGUF/blob/main/Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+# capabilities: tools,vision
+FROM ./Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+
+# Specialized parser for Mistral 3 logic
+PARSER ministral
+
+# Full context and recommended stability parameters
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.15
+PARAMETER top_p 1.0
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+
+# Template: Basic structure that lets the PARSER handle the heavy lifting
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """You are a helpful and concise AI assistant."""
--- a/modelfile-repo/ministral-3-3b-reasoning-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-3b-reasoning-2512-q5_k_m.Modelfile
@@ -0,0 +1,26 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Reasoning-2512-gguf
+# quantization: q5_k_m
+# capabilities: tools,vision,thinking
+FROM ./Ministral-3-3B-Reasoning-2512-Q5_K_M.gguf
+
+# Specialized parser
+PARSER ministral
+
+# Parameters tuned for logic/reasoning
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.7
+PARAMETER top_p 0.95
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+PARAMETER stop "[/THINK]"
+
+# Template
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """# HOW YOU SHOULD THINK AND ANSWER
+
+First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
+
+Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response."""
--- a/modelfile-repo/ministral-3-8b-instruct-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-8b-instruct-2512-q5_k_m.Modelfile
@@ -0,0 +1,20 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512-GGUF/blob/main/Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+# capabilities: tools,vision
+FROM ./Ministral-3-3B-Instruct-2512-Q5_K_M.gguf
+
+# Specialized parser for Mistral 3 logic
+PARSER ministral
+
+# Full context and recommended stability parameters
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.15
+PARAMETER top_p 1.0
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+
+# Template: Basic structure that lets the PARSER handle the heavy lifting
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """You are a helpful and concise AI assistant."""
--- a/modelfile-repo/ministral-3-8b-reasoning-2512-q5_k_m.Modelfile
+++ b/modelfile-repo/ministral-3-8b-reasoning-2512-q5_k_m.Modelfile
@@ -0,0 +1,26 @@
+# ollama-utils-metadata
+# hf_upstream: https://huggingface.co/mistralai/Ministral-3-3B-Reasoning-2512-gguf
+# quantization: q5_k_m
+# capabilities: tools,vision,thinking
+FROM ./Ministral-3-3B-Reasoning-2512-Q5_K_M.gguf
+
+# Specialized parser
+PARSER ministral
+
+# Parameters tuned for logic/reasoning
+PARAMETER num_ctx 32768
+PARAMETER temperature 0.7
+PARAMETER top_p 0.95
+PARAMETER stop "</s>"
+PARAMETER stop "[INST]"
+PARAMETER stop "[/INST]"
+PARAMETER stop "[/THINK]"
+
+# Template
+TEMPLATE """{{ if .System }}[SYSTEM_PROMPT]{{ .System }}[/SYSTEM_PROMPT]{{ end }}[INST]{{ .Prompt }}[/INST]"""
+
+SYSTEM """# HOW YOU SHOULD THINK AND ANSWER
+
+First draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.
+
+Your thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response."""
--- a/modelfile-repo/script.js
+++ b/modelfile-repo/script.js
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1 @@
+Flask==3.1.2
--- a/scripts/context-optimizer.py
+++ b/scripts/context-optimizer.py
@@ -0,0 +1,557 @@
+#!/usr/bin/env python3
+"""
+Ollama Context Optimizer - Find optimal num_ctx for models based on VRAM
+Iteratively tests different context sizes to recommend the best setting.
+"""
+
+import argparse
+import json
+import subprocess
+import sys
+import time
+import urllib.request
+import urllib.error
+from typing import Dict, Optional, Tuple
+
+
+def get_ollama_url():
+    """Get the Ollama API URL."""
+    return "http://localhost:11434"
+
+
+def get_gpu_vram() -> Tuple[Optional[float], Optional[float]]:
+    """Get GPU VRAM total and available in GB."""
+    import os
+    
+    device_paths = [
+        "/sys/class/drm/card1/device/",
+        "/sys/class/drm/card0/device/",
+    ]
+    
+    for base_path in device_paths:
+        if not os.path.exists(base_path):
+            continue
+        
+        try:
+            with open(base_path + "mem_info_vram_used", "r") as f:
+                used = int(f.read().strip()) / 1024 / 1024 / 1024  # GB
+            with open(base_path + "mem_info_vram_total", "r") as f:
+                total = int(f.read().strip()) / 1024 / 1024 / 1024  # GB
+            
+            available = total - used
+            return total, available
+        except:
+            continue
+    
+    return None, None
+
+
+def get_model_info(model_name: str) -> Dict:
+    """Get model information including max context capability."""
+    try:
+        result = subprocess.run(
+            ['ollama', 'show', model_name],
+            capture_output=True,
+            text=True,
+            check=True
+        )
+        
+        info = {
+            'max_context': 0,
+            'current_num_ctx': 0,
+            'params': 'N/A',
+            'quant': 'N/A'
+        }
+        
+        current_section = None
+        for line in result.stdout.split('\n'):
+            line = line.strip()
+            if not line:
+                continue
+            
+            if line in ["Model", "Parameters"]:
+                current_section = line
+                continue
+            
+            if current_section == "Model":
+                parts = line.split(maxsplit=1)
+                if len(parts) == 2:
+                    k, v = parts[0].lower(), parts[1].strip()
+                    if 'context' in k and 'length' in k:
+                        if v.isdigit():
+                            info['max_context'] = int(v)
+                    elif 'context' in k:
+                        # Handle "context length" as two words
+                        parts2 = line.split()
+                        if len(parts2) >= 3 and parts2[-1].isdigit():
+                            info['max_context'] = int(parts2[-1])
+                    elif 'quantization' in k:
+                        info['quant'] = v
+                    elif 'parameters' in k:
+                        info['params'] = v
+            
+            elif current_section == "Parameters":
+                if 'num_ctx' in line.lower():
+                    parts = line.split(maxsplit=1)
+                    if len(parts) == 2 and parts[1].strip().isdigit():
+                        info['current_num_ctx'] = int(parts[1].strip())
+        
+        return info
+    except subprocess.CalledProcessError:
+        return {'max_context': 0, 'current_num_ctx': 0, 'params': 'N/A', 'quant': 'N/A'}
+
+
+def test_context_size(model_name: str, num_ctx: int) -> Optional[Dict]:
+    """
+    Test a model with a specific context size.
+    Returns VRAM usage and offload info, or None if failed.
+    """
+    url = f"{get_ollama_url()}/api/generate"
+    prompt_data = {
+        "model": model_name,
+        "prompt": "Reply with only: OK",
+        "stream": False,
+        "options": {
+            "num_ctx": num_ctx
+        }
+    }
+    
+    try:
+        req = urllib.request.Request(
+            url,
+            data=json.dumps(prompt_data).encode('utf-8'),
+            headers={'Content-Type': 'application/json'}
+        )
+        
+        # Send request with longer timeout for large contexts
+        # Large contexts can take time to allocate
+        timeout = 60 if num_ctx > 100000 else 30
+        
+        with urllib.request.urlopen(req, timeout=timeout) as response:
+            response_data = response.read().decode()
+            
+            # Check if response contains error
+            try:
+                resp_json = json.loads(response_data)
+                if 'error' in resp_json:
+                    error_msg = resp_json['error']
+                    # Return special dict to indicate OOM or other errors
+                    return {
+                        'vram_gb': 0,
+                        'total_gb': 0,
+                        'offload_pct': 0,
+                        'num_ctx': num_ctx,
+                        'error': error_msg
+                    }
+            except:
+                pass
+        
+        # Wait for model to stabilize
+        time.sleep(0.5)
+        
+        # Check /api/ps for VRAM usage
+        ps_url = f"{get_ollama_url()}/api/ps"
+        with urllib.request.urlopen(ps_url, timeout=5) as r:
+            ps_data = json.loads(r.read().decode())
+            models = ps_data.get('models', [])
+            
+            for m in models:
+                if m['name'] == model_name or m['name'].startswith(model_name + ':'):
+                    size_bytes = m.get('size', 0)
+                    size_vram = m.get('size_vram', 0)
+                    
+                    vram_gb = size_vram / (1024**3)
+                    total_gb = size_bytes / (1024**3)
+                    
+                    offload_pct = 0
+                    if size_bytes > 0:
+                        offload_pct = ((size_bytes - size_vram) / size_bytes) * 100
+                    
+                    return {
+                        'vram_gb': vram_gb,
+                        'total_gb': total_gb,
+                        'offload_pct': offload_pct,
+                        'num_ctx': num_ctx
+                    }
+        
+        return None
+        
+    except urllib.error.HTTPError as e:
+        # HTTP errors (500, etc.) - often indicates OOM or model loading failure
+        try:
+            error_body = e.read().decode()
+            error_data = json.loads(error_body)
+            error_msg = error_data.get('error', str(e))
+        except:
+            error_msg = f"HTTP {e.code}"
+        
+        return {
+            'vram_gb': 0,
+            'total_gb': 0,
+            'offload_pct': 0,
+            'num_ctx': num_ctx,
+            'error': error_msg
+        }
+    
+    except urllib.error.URLError as e:
+        # Network/timeout errors
+        if 'timed out' in str(e).lower():
+            error_msg = "Timeout (loading too slow)"
+        else:
+            error_msg = f"Connection error: {e}"
+        
+        return {
+            'vram_gb': 0,
+            'total_gb': 0,
+            'offload_pct': 0,
+            'num_ctx': num_ctx,
+            'error': error_msg
+        }
+    
+    except Exception as e:
+        # Other unexpected errors
+        return {
+            'vram_gb': 0,
+            'total_gb': 0,
+            'offload_pct': 0,
+            'num_ctx': num_ctx,
+            'error': str(e)[:50]
+        }
+
+
+def find_optimal_context(model_name: str, max_turns: Optional[int], overhead_gb: float) -> Dict:
+    """
+    Find optimal context size through intelligent testing.
+    Uses VRAM measurements to extrapolate optimal size.
+    
+    Args:
+        model_name: Name of the Ollama model to test
+        max_turns: Maximum iterations (None = optimize until convergence)
+        overhead_gb: VRAM to keep free for system overhead
+    """
+    print(f"Analyzing model: {model_name}")
+    print("-" * 70)
+    
+    # Get model capabilities
+    info = get_model_info(model_name)
+    max_context = info['max_context']
+    current_ctx = info['current_num_ctx']
+    
+    print(f"Model: {model_name}")
+    print(f"Parameters: {info['params']} ({info['quant']})")
+    print(f"Max context capability: {max_context:,}")
+    print(f"Current num_ctx: {current_ctx:,}")
+    
+    if max_context == 0:
+        print("\n✗ Could not determine model's max context capability")
+        return {}
+    
+    # Get VRAM info
+    vram_total, vram_available = get_gpu_vram()
+    if vram_total:
+        print(f"GPU VRAM: {vram_available:.1f} GB available / {vram_total:.1f} GB total")
+        print(f"Overhead reserved: {overhead_gb:.1f} GB")
+        # Reserve specified overhead
+        target_vram = vram_total - overhead_gb
+    else:
+        print("⚠ Could not detect GPU VRAM (testing will continue)")
+        target_vram = None
+    
+    if max_turns:
+        print(f"Testing with max {max_turns} iterations...")
+    else:
+        print(f"Testing until convergence (num_ctx must be multiple of 2048)...")
+    print()
+    
+    results = []
+    
+    # Turn 1: Test current setting to establish baseline
+    test_ctx = current_ctx if current_ctx > 0 else 8192
+    turn_label = f"Turn 1/{max_turns}" if max_turns else "Turn 1"
+    print(f"{turn_label}: Testing num_ctx={test_ctx:,} (baseline)...", end=' ', flush=True)
+    result = test_context_size(model_name, test_ctx)
+    
+    if result and 'error' not in result:
+        results.append(result)
+        print(f"✓ VRAM: {result['vram_gb']:.2f} GB, Offload: {result['offload_pct']:.1f}% CPU" if result['offload_pct'] > 0 else f"✓ VRAM: {result['vram_gb']:.2f} GB, Offload: GPU only")
+        baseline_vram = result['vram_gb']
+        baseline_ctx = test_ctx
+    else:
+        print("✗ Failed")
+        return {'model': model_name, 'results': results, 'max_context': max_context, 'current_ctx': current_ctx, 'vram_total': vram_total, 'info': info}
+    
+    # Turn 2: Test a higher context to calculate VRAM/context ratio
+    # Try doubling the context or 32K, whichever is smaller
+    test_ctx_2 = min(baseline_ctx * 2, 32768, max_context)
+    if test_ctx_2 <= baseline_ctx:
+        test_ctx_2 = min(baseline_ctx + 16384, max_context)
+    # Round to multiple of 2048
+    test_ctx_2 = (test_ctx_2 // 2048) * 2048
+    
+    turn_label = f"Turn 2/{max_turns}" if max_turns else "Turn 2"
+    print(f"{turn_label}: Testing num_ctx={test_ctx_2:,} (calibration)...", end=' ', flush=True)
+    result = test_context_size(model_name, test_ctx_2)
+    
+    if result and 'error' not in result:
+        results.append(result)
+        print(f"✓ VRAM: {result['vram_gb']:.2f} GB, Offload: {result['offload_pct']:.1f}% CPU" if result['offload_pct'] > 0 else f"✓ VRAM: {result['vram_gb']:.2f} GB, Offload: GPU only")
+        
+        # Calculate VRAM per 1K context tokens
+        vram_diff = result['vram_gb'] - baseline_vram
+        ctx_diff = test_ctx_2 - baseline_ctx
+        if ctx_diff > 0:
+            vram_per_1k_ctx = (vram_diff / ctx_diff) * 1000
+            print(f"         → Estimated VRAM usage: {vram_per_1k_ctx:.4f} GB per 1K context")
+            
+            # Predict optimal context size
+            if target_vram and vram_per_1k_ctx > 0:
+                available_for_ctx = target_vram - baseline_vram
+                estimated_additional_ctx = (available_for_ctx / vram_per_1k_ctx) * 1000
+                predicted_optimal = baseline_ctx + int(estimated_additional_ctx)
+                # Round to multiple of 2048
+                predicted_optimal = (predicted_optimal // 2048) * 2048
+                predicted_optimal = max(baseline_ctx, min(predicted_optimal, max_context))
+                
+                print(f"         → Predicted optimal context: {predicted_optimal:,}")
+            else:
+                predicted_optimal = None
+                vram_per_1k_ctx = None
+        else:
+            vram_per_1k_ctx = None
+            predicted_optimal = None
+    else:
+        if result and 'error' in result:
+            error_msg = result['error']
+            if 'memory' in error_msg.lower() or 'oom' in error_msg.lower():
+                print(f"✗ OOM (out of memory)")
+            else:
+                print(f"✗ Error: {error_msg[:30]}")
+        else:
+            print("✗ Failed")
+        vram_per_1k_ctx = None
+        predicted_optimal = None
+    
+    # Remaining turns: Test predicted optimal or use VRAM-based refinement
+    min_ctx = baseline_ctx
+    max_ctx = max_context
+    
+    turn = 2
+    while True:
+        # Check if we should stop
+        if max_turns and turn >= max_turns:
+            break
+        
+        if predicted_optimal and turn == 2:
+            # Turn 3: Test predicted optimal
+            test_ctx = predicted_optimal
+            turn_label = f"Turn {turn + 1}/{max_turns}" if max_turns else f"Turn {turn + 1}"
+            print(f"{turn_label}: Testing num_ctx={test_ctx:,} (predicted optimal)...", end=' ', flush=True)
+        else:
+            # Use VRAM-based prediction if we have the data
+            if vram_per_1k_ctx and target_vram and len(results) > 0:
+                # Find the last successful result (no offload)
+                last_good = None
+                for r in reversed(results):
+                    if r['offload_pct'] == 0:
+                        last_good = r
+                        break
+                
+                if last_good and target_vram:
+                    # Calculate how much more context we can realistically add
+                    available_vram = target_vram - last_good['vram_gb']
+                    
+                    # Calculate potential additional context
+                    additional_ctx = (available_vram / vram_per_1k_ctx) * 1000
+                    
+                    # If we can only add < 8K context, do small increments
+                    if additional_ctx < 8192:
+                        # Small increments - round up to next 2048 boundary
+                        test_ctx = last_good['num_ctx'] + 2048
+                        test_ctx = max(min_ctx + 2048, min(test_ctx, max_ctx))
+                    else:
+                        # Larger headroom - use 60% of predicted to be conservative
+                        test_ctx = last_good['num_ctx'] + int(additional_ctx * 0.6)
+                        test_ctx = (test_ctx // 2048) * 2048
+                        test_ctx = max(min_ctx + 2048, min(test_ctx, max_ctx))
+                else:
+                    # No good result yet - binary search
+                    test_ctx = (min_ctx + max_ctx) // 2
+                    test_ctx = (test_ctx // 2048) * 2048
+            else:
+                # No VRAM data - fall back to binary search
+                test_ctx = (min_ctx + max_ctx) // 2
+                test_ctx = (test_ctx // 2048) * 2048
+            
+            # Avoid retesting same value
+            if any(r['num_ctx'] == test_ctx for r in results):
+                # Adjust by 2048
+                if test_ctx < max_ctx:
+                    test_ctx += 2048
+                else:
+                    test_ctx -= 2048
+                    
+                if test_ctx <= min_ctx or test_ctx >= max_ctx:
+                    print(f"\nConverged after {turn + 1} turns")
+                    break
+            
+            turn_label = f"Turn {turn + 1}/{max_turns}" if max_turns else f"Turn {turn + 1}"
+            print(f"{turn_label}: Testing num_ctx={test_ctx:,}...", end=' ', flush=True)
+        
+        result = test_context_size(model_name, test_ctx)
+        
+        if result is None:
+            print("✗ Failed (model not found)")
+            max_ctx = test_ctx
+            continue
+        
+        if 'error' in result:
+            error_msg = result['error']
+            if 'memory' in error_msg.lower() or 'oom' in error_msg.lower():
+                print(f"✗ OOM (out of memory)")
+            elif 'timeout' in error_msg.lower():
+                print(f"✗ Timeout")
+            else:
+                print(f"✗ Error: {error_msg[:30]}")
+            max_ctx = test_ctx
+            continue
+        
+        results.append(result)
+        
+        offload_str = f"{result['offload_pct']:.1f}% CPU" if result['offload_pct'] > 0 else "GPU only"
+        print(f"✓ VRAM: {result['vram_gb']:.2f} GB, Offload: {offload_str}")
+        
+        # Adjust search bounds
+        if result['offload_pct'] > 0:
+            max_ctx = test_ctx
+        else:
+            min_ctx = test_ctx
+        
+        # Stop if we're converging (within one step of 2048)
+        if max_ctx - min_ctx <= 2048:
+            print(f"\nConverged after {turn + 1} turns")
+            break
+        
+        turn += 1
+    
+    return {
+        'model': model_name,
+        'results': results,
+        'max_context': max_context,
+        'current_ctx': current_ctx,
+        'vram_total': vram_total,
+        'info': info
+    }
+
+
+def print_recommendation(analysis: Dict):
+    """Print optimization recommendations."""
+    if not analysis or not analysis.get('results'):
+        print("\n✗ No results to analyze")
+        return
+    
+    results = analysis['results']
+    max_context = analysis['max_context']
+    current_ctx = analysis['current_ctx']
+    
+    print("\n" + "="*70)
+    print("OPTIMIZATION RECOMMENDATION")
+    print("="*70)
+    
+    # Find best context without offloading
+    no_offload = [r for r in results if r['offload_pct'] == 0]
+    
+    if no_offload:
+        # Recommend highest context without offloading
+        best = max(no_offload, key=lambda x: x['num_ctx'])
+        
+        print(f"\n✓ Recommended num_ctx: {best['num_ctx']:,}")
+        print(f"  VRAM usage: {best['vram_gb']:.2f} GB")
+        print(f"  Status: Fits entirely in GPU memory")
+        
+        if best['num_ctx'] < max_context:
+            print(f"\n⚠ Note: Model supports up to {max_context:,} context")
+            print(f"  but VRAM limits optimal usage to {best['num_ctx']:,}")
+        
+        if current_ctx != best['num_ctx']:
+            print(f"\n📝 Suggested Modelfile change:")
+            print(f"   Current: PARAMETER num_ctx {current_ctx}")
+            print(f"   Optimal: PARAMETER num_ctx {best['num_ctx']}")
+    else:
+        # All tests had offloading
+        print("\n⚠ All tested configurations require CPU offloading")
+        
+        # Find least offloading
+        least_offload = min(results, key=lambda x: x['offload_pct'])
+        
+        print(f"\n  Least offloading at num_ctx={least_offload['num_ctx']:,}")
+        print(f"  CPU offload: {least_offload['offload_pct']:.1f}%")
+        print(f"  VRAM usage: {least_offload['vram_gb']:.2f} GB")
+        
+        print(f"\n💡 Recommendations:")
+        print(f"   1. Use lower quantization (Q4 instead of Q5/Q8)")
+        print(f"   2. Reduce num_ctx to {least_offload['num_ctx']:,} or lower")
+        print(f"   3. Consider a smaller model variant")
+    
+    # VRAM efficiency
+    print(f"\n📊 Tested context sizes:")
+    for r in sorted(results, key=lambda x: x['num_ctx']):
+        status = "✓" if r['offload_pct'] == 0 else "✗"
+        print(f"   {status} {r['num_ctx']:>6,}: {r['vram_gb']:>5.2f} GB VRAM, "
+              f"{r['offload_pct']:>4.1f}% CPU offload")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Optimize Ollama model context size for VRAM constraints',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Optimize with 5 test iterations (default)
+  %(prog)s ministral-3:3b-instruct-2512-q5_k_m
+  
+  # Use 10 iterations for more precise optimization
+  %(prog)s ministral-3:3b-instruct-2512-q5_k_m --turns 10
+        """
+    )
+    
+    parser.add_argument(
+        'model',
+        help='Model name to optimize'
+    )
+    
+    parser.add_argument(
+        '--turns',
+        type=int,
+        default=None,
+        help='Maximum number of test iterations (default: optimize until convergence)'
+    )
+    
+    parser.add_argument(
+        '--overhead',
+        type=float,
+        default=1.0,
+        help='VRAM overhead to keep free in GB (default: 1.0)'
+    )
+    
+    args = parser.parse_args()
+    
+    if args.turns is not None and args.turns < 2:
+        print("✗ Error: --turns must be at least 2")
+        sys.exit(1)
+    
+    # Check if ollama is available
+    try:
+        subprocess.run(['ollama', '--version'], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        print("✗ Error: 'ollama' command not found. Please install Ollama first.")
+        sys.exit(1)
+    
+    # Run optimization
+    analysis = find_optimal_context(args.model, args.turns, args.overhead)
+    
+    # Print recommendations
+    print_recommendation(analysis)
+
+
+if __name__ == '__main__':
+    main()
--- a/scripts/hf-llm-install.py
+++ b/scripts/hf-llm-install.py
@@ -0,0 +1,638 @@
+#!/usr/bin/env python3
+"""
+HuggingFace LLM Installer for Ollama
+Automatically downloads GGUF files from HuggingFace and creates Ollama models.
+
+Features:
+- SHA256 checksum verification
+- Disk space checking
+- Dry run mode
+- Parallel processing
+- Skip existing models
+"""
+
+import argparse
+import hashlib
+import json
+import os
+import re
+import shutil
+import subprocess
+import sys
+import tempfile
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from urllib.parse import urlparse
+import urllib.request
+
+
+def parse_model_name_from_gguf(gguf_filename):
+    """
+    Parse model name and tag from GGUF filename.
+    
+    Args:
+        gguf_filename: Name of the GGUF file
+        
+    Returns:
+        Tuple of (model_base, tag, full_name) or (filename, 'latest', filename) if parsing fails
+    """
+    filename_stem = Path(gguf_filename).stem.lower()
+    
+    # Split on hyphens
+    parts = filename_stem.split('-')
+    if len(parts) >= 3:
+        # Find where the size variant starts (e.g., "0.5b", "3b", "8b", "14b")
+        base_parts = []
+        tag_parts = []
+        found_variant = False
+        
+        for part in parts:
+            # Check if this looks like a size variant (e.g., "3b", "8b", "0.5b")
+            if not found_variant and re.match(r'^\d+(\.\d+)?b$', part):
+                found_variant = True
+                tag_parts.append(part)
+            elif found_variant:
+                # Include everything after the variant (including quantization)
+                tag_parts.append(part)
+            else:
+                # Before the variant = base name
+                base_parts.append(part)
+        
+        if base_parts and tag_parts:
+            model_base = '-'.join(base_parts)
+            model_tag = '-'.join(tag_parts)
+            full_name = f"{model_base}:{model_tag}"
+            return (model_base, model_tag, full_name)
+    
+    # Fallback to filename without extension
+    return (filename_stem, 'latest', filename_stem)
+
+
+def parse_modelfile(modelfile_path):
+    """
+    Parse a Modelfile to extract HuggingFace upstream URL and model info.
+    
+    Args:
+        modelfile_path: Path to the .Modelfile
+        
+    Returns:
+        dict with model metadata or None if invalid
+    """
+    with open(modelfile_path, 'r') as f:
+        content = f.read()
+    
+    # Look for hf_upstream in the header comments
+    hf_match = re.search(r'#\s*hf_upstream:\s*(https://huggingface\.co/[^\s]+)', content)
+    if not hf_match:
+        return None
+    
+    hf_url = hf_match.group(1)
+    
+    # Look for optional quantization specification (default: q4_k_m)
+    quant_match = re.search(r'#\s*quantization:\s*([a-zA-Z0-9_]+)', content)
+    quantization = quant_match.group(1).upper() if quant_match else 'Q4_K_M'
+    
+    # Look for optional SHA256 checksum
+    sha256_match = re.search(r'#\s*sha256:\s*([a-fA-F0-9]{64})', content)
+    sha256 = sha256_match.group(1) if sha256_match else None
+    
+    # Look for optional capabilities (comma-separated list)
+    # Format: # capabilities: tools, vision
+    capabilities_match = re.search(r'#\s*capabilities:\s*([^\n]+)', content)
+    capabilities = None
+    if capabilities_match:
+        # Parse comma-separated capabilities and clean whitespace
+        caps_str = capabilities_match.group(1).strip()
+        capabilities = [cap.strip() for cap in caps_str.split(',') if cap.strip()]
+    
+    # Check if URL points to a specific GGUF file or just the repo
+    if hf_url.endswith('.gguf') or '/blob/' in hf_url or '/resolve/' in hf_url:
+        # Specific file provided - use as-is
+        resolve_url = hf_url.replace('/blob/', '/resolve/')
+        gguf_filename = os.path.basename(urlparse(resolve_url).path)
+    else:
+        # Repository root provided - construct filename from repo name and quantization
+        # URL format: https://huggingface.co/{org}/{repo}
+        url_parts = urlparse(hf_url).path.strip('/').split('/')
+        if len(url_parts) >= 2:
+            repo_name = url_parts[1]  # e.g., "Ministral-3-3B-Instruct-2512-GGUF"
+            
+            # Remove -GGUF suffix if present (case-insensitive)
+            if repo_name.upper().endswith('-GGUF'):
+                repo_name = repo_name[:-5]
+            
+            # Construct filename: RepoName-Quantization.gguf
+            gguf_filename = f"{repo_name}-{quantization}.gguf"
+            resolve_url = f"{hf_url.rstrip('/')}/resolve/main/{gguf_filename}"
+        else:
+            print(f"✗ Invalid HuggingFace URL format: {hf_url}")
+            return None
+    
+    # Extract model name and tag from the GGUF filename
+    # Format: Model-Version-Variant-Year-Quant.gguf -> model:version-variant-year-quant
+    # Example: Ministral-3-3B-Instruct-2512-Q5_K_M.gguf -> ministral-3:3b-instruct-2512-q5_k_m
+    model_base, model_tag, model_name = parse_model_name_from_gguf(gguf_filename)
+    
+    return {
+        'hf_url': hf_url,
+        'resolve_url': resolve_url,
+        'gguf_filename': gguf_filename,
+        'model_name': model_name,
+        'modelfile_path': modelfile_path,
+        'sha256': sha256,
+        'capabilities': capabilities
+    }
+
+
+def get_file_size(url):
+    """
+    Get the size of a file from URL without downloading it.
+    
+    Args:
+        url: File URL
+        
+    Returns:
+        Size in bytes or None if unavailable
+    """
+    try:
+        req = urllib.request.Request(url, method='HEAD')
+        with urllib.request.urlopen(req, timeout=10) as response:
+            size = response.headers.get('Content-Length')
+            return int(size) if size else None
+    except Exception:
+        return None
+
+
+def check_disk_space(required_bytes, path='.'):
+    """
+    Check if there's enough disk space available.
+    
+    Args:
+        required_bytes: Required space in bytes
+        path: Path to check space on (default: current directory)
+        
+    Returns:
+        Tuple of (has_space, available_bytes, required_bytes)
+    """
+    # Get absolute path to check actual filesystem
+    abs_path = os.path.abspath(path)
+    stat = shutil.disk_usage(abs_path)
+    # Add 10% safety margin
+    required_with_margin = int(required_bytes * 1.1)
+    return (stat.free >= required_with_margin, stat.free, required_with_margin)
+
+
+def calculate_sha256(filepath, chunk_size=8192):
+    """
+    Calculate SHA256 checksum of a file.
+    
+    Args:
+        filepath: Path to file
+        chunk_size: Bytes to read at once
+        
+    Returns:
+        SHA256 hex digest
+    """
+    sha256_hash = hashlib.sha256()
+    with open(filepath, 'rb') as f:
+        for chunk in iter(lambda: f.read(chunk_size), b''):
+            sha256_hash.update(chunk)
+    return sha256_hash.hexdigest()
+
+
+def verify_checksum(filepath, expected_sha256):
+    """
+    Verify file checksum matches expected value.
+    
+    Args:
+        filepath: Path to file
+        expected_sha256: Expected SHA256 hash
+        
+    Returns:
+        True if match, False otherwise
+    """
+    print(f"  Verifying checksum...")
+    actual = calculate_sha256(filepath)
+    
+    if actual.lower() == expected_sha256.lower():
+        print(f"  ✓ Checksum verified: {actual[:16]}...")
+        return True
+    else:
+        print(f"  ✗ Checksum mismatch!")
+        print(f"    Expected: {expected_sha256}")
+        print(f"    Actual:   {actual}")
+        return False
+
+
+def get_existing_models():
+    """
+    Get list of existing Ollama models.
+    
+    Returns:
+        Set of model names
+    """
+    try:
+        result = subprocess.run(
+            ['ollama', 'list'],
+            capture_output=True,
+            text=True,
+            check=True
+        )
+        
+        # Parse output to get model names
+        # Format: NAME                    ID              SIZE      MODIFIED
+        models = set()
+        for line in result.stdout.strip().split('\n')[1:]:  # Skip header
+            if line.strip():
+                # Get first column (name)
+                name = line.split()[0]
+                # Remove tag if present
+                base_name = name.split(':')[0]
+                models.add(base_name)
+        
+        return models
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        return set()
+
+
+def download_file(url, dest_path, filename, should_cancel=None, progress_callback=None):
+    """
+    Download a file from URL to destination with progress indication.
+    
+    Args:
+        url: Source URL
+        dest_path: Destination file path
+        filename: Name for display purposes
+        should_cancel: Optional callback function that returns True if download should be cancelled
+        progress_callback: Optional callback function to report progress messages
+    """
+    def log(msg):
+        """Helper to print and optionally call progress callback."""
+        print(msg)
+        if progress_callback:
+            progress_callback(msg)
+    
+    log(f"Downloading {filename}...")
+    log(f"  From: {url}")
+    log(f"  To: {dest_path}")
+    
+    def show_progress(block_num, block_size, total_size):
+        # Check for cancellation
+        if should_cancel and should_cancel():
+            raise InterruptedError("Download cancelled")
+        
+        downloaded = block_num * block_size
+        if total_size > 0:
+            percent = min(100, downloaded * 100 / total_size)
+            mb_downloaded = downloaded / (1024 * 1024)
+            mb_total = total_size / (1024 * 1024)
+            msg = f"\r  Progress: {percent:.1f}% ({mb_downloaded:.1f}/{mb_total:.1f} MB)"
+            print(msg, end='')
+            if progress_callback:
+                progress_callback(f"Progress: {percent:.1f}% ({mb_downloaded:.1f}/{mb_total:.1f} MB)")
+    
+    try:
+        urllib.request.urlretrieve(url, dest_path, show_progress)
+        print()  # New line after progress
+        log(f"✓ Download complete")
+    except Exception as e:
+        print(f"\n✗ Download failed: {e}")
+        if progress_callback:
+            progress_callback(f"✗ Download failed: {e}")
+        raise
+
+
+def create_ollama_model(modelfile_path, gguf_path, model_name, capabilities=None):
+    """
+    Create an Ollama model from the Modelfile and GGUF file.
+    
+    Args:
+        modelfile_path: Path to the .Modelfile
+        gguf_path: Path to the downloaded GGUF file
+        model_name: Name for the Ollama model
+        capabilities: Optional list of capabilities to add (e.g., ['tools', 'vision'])
+    """
+    print(f"\nCreating Ollama model: {model_name}")
+    
+    # Note: Capabilities are detected from the GGUF file metadata by Ollama automatically
+    if capabilities:
+        print(f"  ℹ Expected capabilities from GGUF metadata: {', '.join(capabilities)}")
+    
+    # Read the Modelfile and update the FROM path to point to the downloaded GGUF
+    with open(modelfile_path, 'r') as f:
+        modelfile_content = f.read()
+    
+    # Replace the FROM line to use the actual GGUF path
+    # Handle both relative paths like "./filename.gguf" and URLs like "https://..."
+    original_content = modelfile_content
+    modelfile_content = re.sub(
+        r'FROM\s+(?:\./[^\s]+\.gguf|https?://[^\n]+)',
+        f'FROM {gguf_path}',
+        modelfile_content
+    )
+    
+    # Debug: check if replacement happened
+    if original_content == modelfile_content:
+        print(f"  WARNING: FROM line was not replaced!")
+        print(f"  Looking for pattern in: {original_content[:200]}")
+    else:
+        print(f"  ✓ Replaced FROM line with local path: {gguf_path}")
+    
+    # Create a temporary Modelfile with the correct path
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.Modelfile', delete=False) as tmp_modelfile:
+        tmp_modelfile.write(modelfile_content)
+        tmp_modelfile_path = tmp_modelfile.name
+    
+    try:
+        # Run ollama create
+        cmd = ['ollama', 'create', model_name, '-f', tmp_modelfile_path]
+        print(f"  Running: {' '.join(cmd)}")
+        
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True
+        )
+        
+        if result.returncode == 0:
+            print(f"✓ Model '{model_name}' created successfully")
+            if result.stdout:
+                print(f"  {result.stdout.strip()}")
+        else:
+            print(f"✗ Failed to create model")
+            if result.stderr:
+                print(f"  Error: {result.stderr.strip()}")
+            raise subprocess.CalledProcessError(result.returncode, cmd)
+            
+    finally:
+        # Clean up temporary Modelfile
+        os.unlink(tmp_modelfile_path)
+
+
+def install_model(modelfile_path, dry_run=False, skip_existing=False, existing_models=None, should_cancel=None, progress_callback=None):
+    """
+    Install a single model from a Modelfile.
+    
+    Args:
+        modelfile_path: Path to the .Modelfile
+        dry_run: If True, only simulate installation
+        skip_existing: If True, skip models already in Ollama
+        existing_models: Set of existing model names
+        should_cancel: Optional callback function that returns True if installation should be cancelled
+        progress_callback: Optional callback function to report progress messages
+        
+    Returns:
+        Tuple of (success: bool, skipped: bool, model_name: str)
+    """
+    def log(msg):
+        """Helper to print and optionally call progress callback."""
+        print(msg)
+        if progress_callback:
+            progress_callback(msg)
+    log(f"\n{'='*80}")
+    log(f"Processing: {modelfile_path}")
+    log(f"{'='*80}")
+    
+    # Parse the Modelfile
+    model_info = parse_modelfile(modelfile_path)
+    if not model_info:
+        log(f"✗ No hf_upstream found in {modelfile_path}")
+        return (False, False, None)
+    
+    log(f"Model name: {model_info['model_name']}")
+    log(f"GGUF file: {model_info['gguf_filename']}")
+    if model_info['sha256']:
+        log(f"SHA256: {model_info['sha256'][:16]}...")
+    if model_info.get('capabilities'):
+        log(f"Capabilities: {', '.join(model_info['capabilities'])}")
+    
+    # Check if model already exists
+    if skip_existing and existing_models and model_info['model_name'] in existing_models:
+        log(f"⊘ Model '{model_info['model_name']}' already exists, skipping")
+        return (True, True, model_info['model_name'])
+    
+    # Get file size and check disk space
+    file_size = get_file_size(model_info['resolve_url'])
+    if file_size:
+        size_gb = file_size / (1024**3)
+        log(f"File size: {size_gb:.2f} GB")
+        
+        if not dry_run:
+            has_space, available, required = check_disk_space(file_size)
+            if not has_space:
+                log(f"✗ Insufficient disk space!")
+                log(f"  Required: {required / (1024**3):.2f} GB (with 10% margin)")
+                log(f"  Available: {available / (1024**3):.2f} GB")
+                return (False, False, model_info['model_name'])
+            else:
+                log(f"✓ Disk space check passed ({available / (1024**3):.2f} GB available)")
+    
+    if dry_run:
+        log(f"\n[DRY RUN] Would download and install model: {model_info['model_name']}")
+        return (True, False, model_info['model_name'])
+    
+    # Create temporary directory for download
+    with tempfile.TemporaryDirectory() as tmp_dir:
+        gguf_path = os.path.join(tmp_dir, model_info['gguf_filename'])
+        
+        try:
+            # Download the GGUF file
+            download_file(model_info['resolve_url'], gguf_path, model_info['gguf_filename'], should_cancel, progress_callback)
+            
+            # Verify checksum if provided
+            if model_info['sha256']:
+                if not verify_checksum(gguf_path, model_info['sha256']):
+                    print(f"✗ Checksum verification failed!")
+                    return (False, False, model_info['model_name'])
+            
+            # Create the Ollama model
+            create_ollama_model(
+                modelfile_path,
+                gguf_path,
+                model_info['model_name'],
+                model_info.get('capabilities')
+            )
+            
+            print(f"\n✓ Successfully installed model: {model_info['model_name']}")
+            return (True, False, model_info['model_name'])
+            
+        except Exception as e:
+            print(f"\n✗ Failed to install model: {e}")
+            return (False, False, model_info['model_name'])
+
+
+def install_model_wrapper(args):
+    """Wrapper for parallel execution."""
+    return install_model(*args)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Install Ollama models from HuggingFace using Modelfiles',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Install a single model
+  %(prog)s path/to/model.Modelfile
+  
+  # Install all models in the default repo directory
+  %(prog)s
+  
+  # Dry run to see what would be installed
+  %(prog)s --dry-run
+  
+  # Skip models that already exist
+  %(prog)s --skip-existing
+  
+  # Install with 3 parallel downloads
+  %(prog)s --parallel 3
+        """
+    )
+    
+    parser.add_argument(
+        'modelfile',
+        nargs='?',
+        help='Path to a specific .Modelfile to install (optional)'
+    )
+    
+    parser.add_argument(
+        '--dir',
+        default='./modelfile-repo',
+        help='Directory containing .Modelfile files (default: ./modelfile-repo)'
+    )
+    
+    parser.add_argument(
+        '--dry-run',
+        action='store_true',
+        help='Simulate installation without downloading or creating models'
+    )
+    
+    parser.add_argument(
+        '--skip-existing',
+        action='store_true',
+        help='Skip models that already exist in Ollama'
+    )
+    
+    parser.add_argument(
+        '--parallel',
+        type=int,
+        default=1,
+        metavar='N',
+        help='Number of parallel downloads/installations (default: 1)'
+    )
+    
+    args = parser.parse_args()
+    
+    # Validate parallel argument
+    if args.parallel < 1:
+        print("✗ Error: --parallel must be at least 1")
+        sys.exit(1)
+    
+    # Check if ollama is available
+    try:
+        subprocess.run(['ollama', '--version'], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        print("✗ Error: 'ollama' command not found. Please install Ollama first.")
+        print("  Visit: https://ollama.ai")
+        sys.exit(1)
+    
+    # Get existing models if skip_existing is enabled
+    existing_models = None
+    if args.skip_existing:
+        existing_models = get_existing_models()
+        if existing_models:
+            print(f"Found {len(existing_models)} existing model(s)")
+    
+    # Determine which Modelfiles to process
+    if args.modelfile:
+        # Single file mode
+        modelfile_path = Path(args.modelfile)
+        if not modelfile_path.exists():
+            print(f"✗ Error: File not found: {modelfile_path}")
+            sys.exit(1)
+        
+        if not modelfile_path.suffix == '.Modelfile':
+            print(f"✗ Error: File must have .Modelfile extension")
+            sys.exit(1)
+        
+        modelfiles = [modelfile_path]
+    else:
+        # Batch mode - process all .Modelfile files in directory
+        modelfile_dir = Path(args.dir)
+        if not modelfile_dir.exists():
+            print(f"✗ Error: Directory not found: {modelfile_dir}")
+            sys.exit(1)
+        
+        modelfiles = sorted(modelfile_dir.glob('*.Modelfile'))
+        if not modelfiles:
+            print(f"✗ No .Modelfile files found in {modelfile_dir}")
+            sys.exit(1)
+        
+        print(f"Found {len(modelfiles)} Modelfile(s) to process")
+    
+    if args.dry_run:
+        print("\n*** DRY RUN MODE - No files will be downloaded or models created ***\n")
+    
+    # Process all Modelfiles
+    results = []
+    
+    if args.parallel > 1 and len(modelfiles) > 1:
+        # Parallel processing
+        print(f"\nUsing {args.parallel} parallel worker(s)")
+        
+        with ThreadPoolExecutor(max_workers=args.parallel) as executor:
+            # Submit all tasks
+            future_to_modelfile = {
+                executor.submit(
+                    install_model_wrapper,
+                    (modelfile, args.dry_run, args.skip_existing, existing_models)
+                ): modelfile
+                for modelfile in modelfiles
+            }
+            
+            # Collect results as they complete
+            for future in as_completed(future_to_modelfile):
+                modelfile = future_to_modelfile[future]
+                try:
+                    success, skipped, model_name = future.result()
+                    results.append((modelfile.name, success, skipped))
+                except Exception as e:
+                    print(f"\n✗ Exception processing {modelfile.name}: {e}")
+                    results.append((modelfile.name, False, False))
+    else:
+        # Sequential processing
+        for modelfile in modelfiles:
+            success, skipped, model_name = install_model(
+                modelfile,
+                args.dry_run,
+                args.skip_existing,
+                existing_models
+            )
+            results.append((modelfile.name, success, skipped))
+    
+    # Summary
+    print(f"\n{'='*80}")
+    print("INSTALLATION SUMMARY")
+    print(f"{'='*80}")
+    
+    successful = sum(1 for _, success, skipped in results if success and not skipped)
+    skipped = sum(1 for _, success, skip in results if skip)
+    failed = len(results) - successful - skipped
+    
+    for name, success, skip in results:
+        if skip:
+            status = "⊘"
+        elif success:
+            status = "✓"
+        else:
+            status = "✗"
+        print(f"{status} {name}")
+    
+    print(f"\nTotal: {len(results)} | Successful: {successful} | Skipped: {skipped} | Failed: {failed}")
+    
+    if failed > 0:
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/scripts/model-info.py
+++ b/scripts/model-info.py
@@ -0,0 +1,184 @@
+#!/usr/bin/env python3
+"""
+Ollama Model Inventory
+- Parses the official 'Capabilities' section from ollama show
+- Accurate VRAM estimation
+"""
+
+import subprocess
+import re
+from typing import Dict, List
+
+def get_cmd_output(cmd: List[str]) -> str:
+    try:
+        # Run command and get stdout
+        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+        return result.stdout.strip()
+    except subprocess.CalledProcessError:
+        return ""
+
+def parse_parameters(param_str: str) -> float:
+    """Parses '8.0B' or '307M' into standard Billions (float)"""
+    if not param_str or param_str == "N/A": return 0.0
+    clean_val = re.sub(r"[^0-9.]", "", param_str)
+    try:
+        val = float(clean_val)
+        if "M" in param_str.upper(): return val / 1000.0
+        return val
+    except ValueError: return 0.0
+
+def estimate_vram(params_billions: float, quant: str, context: int, context_used: int) -> str:
+    """Estimates VRAM usage (Model Weights + Typical KV Cache)."""
+    if params_billions == 0.0: return "N/A"
+
+    # 1. Weights Size (bits per parameter)
+    q_up = quant.upper()
+    if "MXFP4" in q_up or "FP4" in q_up: bpp = 0.55
+    elif "Q8" in q_up: bpp = 1.0
+    elif "Q6" in q_up: bpp = 0.85
+    elif "Q5" in q_up: bpp = 0.75
+    elif "Q4" in q_up: bpp = 0.65
+    elif "Q3" in q_up: bpp = 0.55
+    elif "Q2" in q_up: bpp = 0.45
+    elif "IQ" in q_up: bpp = 0.35  # IQ quantization
+    elif "F16" in q_up or "BF16" in q_up: bpp = 2.0
+    elif "F32" in q_up: bpp = 4.0
+    else: bpp = 0.65  # Default Q4_K_M
+
+    weight_gb = params_billions * bpp
+
+    # 2. KV Cache Size
+    # More accurate formula: context_tokens * embedding_dim * layers * 2 (K+V) * bytes_per_value / 1e9
+    # Simplified: For a typical LLM, ~0.002 GB per 1000 tokens at FP16
+    # Use actual context_used if available, otherwise use a reasonable default (8K)
+    effective_context = context_used if context_used > 0 else min(context, 8192)
+    kv_cache_gb = (effective_context / 1000) * 0.002
+    
+    # 3. System Overhead (Ollama runtime, etc.)
+    overhead_gb = 0.3
+
+    total_gb = weight_gb + kv_cache_gb + overhead_gb
+    
+    if total_gb < 1: return f"{total_gb * 1024:.0f} MB"
+    return f"{total_gb:.1f} GB"
+
+def get_model_info(name: str, disk_size: str) -> Dict:
+    try:
+        raw_show = get_cmd_output(['ollama', 'show', name])
+    except Exception as e:
+        return {
+            'model': name,
+            'disk': disk_size,
+            'family': 'ERROR',
+            'params_str': 'N/A',
+            'quant': 'N/A',
+            'context': 0,
+            'context_used': 0,
+            'caps': [],
+            'vram': 'N/A'
+        }
+    
+    info = {
+        'model': name,
+        'disk': disk_size,
+        'family': 'N/A',
+        'params_str': 'N/A',
+        'quant': 'N/A',
+        'context': 0,
+        'context_used': 0,  # Actual context from Parameters section
+        'caps': []
+    }
+
+    # -- State Machine Parsing --
+    current_section = None
+    
+    lines = raw_show.split('\n')
+    for line in lines:
+        line = line.strip()
+        if not line: continue
+        
+        # Detect Sections
+        if line in ["Model", "Capabilities", "Parameters", "System", "License"]:
+            current_section = line
+            continue
+
+        # Parse 'Model' Section
+        if current_section == "Model":
+            parts = line.split(maxsplit=1)
+            if len(parts) == 2:
+                k, v = parts[0].lower(), parts[1].strip()
+                if 'architecture' in k: info['family'] = v
+                elif 'parameters' in k: info['params_str'] = v
+                elif 'quantization' in k: info['quant'] = v
+                elif 'context' in k and 'length' in k:
+                    if v.isdigit(): info['context'] = int(v)
+            
+            # Fallback regex for context
+            if 'context' in line.lower() and info['context'] == 0:
+                match = re.search(r'context\s+length\s+(\d+)', line, re.IGNORECASE)
+                if match: info['context'] = int(match.group(1))
+
+        # Parse 'Parameters' Section (runtime config)
+        elif current_section == "Parameters":
+            if 'num_ctx' in line.lower():
+                parts = line.split(maxsplit=1)
+                if len(parts) == 2 and parts[1].strip().isdigit():
+                    info['context_used'] = int(parts[1].strip())
+
+        # Parse 'Capabilities' Section
+        elif current_section == "Capabilities":
+            cap = line.lower()
+            if cap in ['tools', 'vision', 'thinking', 'insert']:
+                info['caps'].append(cap.capitalize())
+
+    # -- VRAM Calc --
+    p_val = parse_parameters(info['params_str'])
+    info['vram'] = estimate_vram(p_val, info['quant'], info['context'], info['context_used'])
+
+    return info
+
+def main():
+    print("Fetching Ollama inventory...")
+    list_out = get_cmd_output(['ollama', 'list'])
+    
+    data = []
+    lines = list_out.split('\n')[1:]
+    
+    for line in lines:
+        if not line.strip(): continue
+        parts = line.split()
+        if len(parts) >= 3:
+            name = parts[0]
+            disk = parts[2]
+            print(f"   Analyzing {name}...", end='\r')
+            data.append(get_model_info(name, disk))
+            
+    print(" " * 60, end='\r')
+
+    # Formatting Table
+    w = {'m': 38, 'a': 12, 'p': 8, 'q': 10, 'ctx': 12, 'cp': 18, 'd': 8, 'v': 8}
+    
+    header = (f"{'MODEL':<{w['m']}} {'ARCH':<{w['a']}} {'PARAMS':<{w['p']}} "
+              f"{'QUANT':<{w['q']}} {'CONTEXT':<{w['ctx']}} {'CAPS':<{w['cp']}} "
+              f"{'DISK':>{w['d']}} {'VRAM':>{w['v']}}")
+    
+    print(header)
+    print("-" * len(header))
+
+    for r in data:
+        caps_str = ", ".join(r['caps']) if r['caps'] else "-"
+        # Truncate overly long names
+        d_name = (r['model'][:w['m']-2] + '..') if len(r['model']) > w['m'] else r['model']
+        
+        # Format context: show used/max or just max if used not set
+        if r['context_used'] > 0:
+            ctx_str = f"{r['context_used']}/{r['context']}"
+        else:
+            ctx_str = str(r['context'])
+        
+        print(f"{d_name:<{w['m']}} {r['family']:<{w['a']}} {r['params_str']:<{w['p']}} "
+              f"{r['quant']:<{w['q']}} {ctx_str:<{w['ctx']}} {caps_str:<{w['cp']}} "
+              f"{r['disk']:>{w['d']}} {r['vram']:>{w['v']}}")
+
+if __name__ == "__main__":
+    main()
--- a/scripts/ollama-monitor.py
+++ b/scripts/ollama-monitor.py
@@ -0,0 +1,164 @@
+#!/usr/bin/env python3
+"""
+Ollama Monitor - Real-time dashboard for Ollama instances
+"""
+
+import urllib.request
+import json
+import subprocess
+import time
+import os
+import sys
+
+# Terminal colors
+CLEAR, BOLD, RESET = "\033[2J\033[H", "\033[1m", "\033[0m"
+CYAN, GREEN, YELLOW, MAGENTA, RED = "\033[36m", "\033[32m", "\033[33m", "\033[35m", "\033[31m"
+
+
+def discover_ollama_instances():
+    """Auto-discover running Ollama instances."""
+    instances = {}
+    
+    # Try default port
+    if check_ollama_available("http://localhost:11434"):
+        instances["Ollama (default)"] = "http://localhost:11434"
+    
+    # Try common alternative ports
+    for port in [11435, 11436]:
+        url = f"http://localhost:{port}"
+        if check_ollama_available(url):
+            instances[f"Ollama (port {port})"] = url
+    
+    return instances
+
+
+def check_ollama_available(url):
+    """Check if an Ollama instance is available at the given URL."""
+    try:
+        with urllib.request.urlopen(f"{url}/api/tags", timeout=1) as r:
+            return r.status == 200
+    except:
+        return False
+
+
+def get_ollama_ps(url):
+    """Get running models from Ollama instance."""
+    try:
+        with urllib.request.urlopen(f"{url}/api/ps", timeout=0.5) as r:
+            return json.loads(r.read().decode()).get('models', [])
+    except Exception:
+        return None
+
+
+def get_gpu_metrics():
+    """Try to get GPU metrics from AMD GPU sysfs."""
+    try:
+        # Try multiple possible GPU device paths
+        device_paths = [
+            "/sys/class/drm/card1/device/",
+            "/sys/class/drm/card0/device/",
+        ]
+        
+        for base_path in device_paths:
+            if not os.path.exists(base_path):
+                continue
+                
+            try:
+                with open(base_path + "mem_info_vram_used", "r") as f:
+                    used = int(f.read().strip()) / 1024 / 1024
+                with open(base_path + "mem_info_vram_total", "r") as f:
+                    total = int(f.read().strip()) / 1024 / 1024
+                with open(base_path + "gpu_busy_percent", "r") as f:
+                    load = int(f.read().strip())
+                
+                # Sanity check: If VRAM usage is low but load is 99%, it's a driver glitch
+                if load == 99 and used < (total * 0.1):
+                    load = 0
+                    
+                return used, total, load
+            except:
+                continue
+                
+        return None, None, None
+    except:
+        return None, None, None
+
+
+def get_sys_metrics():
+    """Get system CPU and RAM metrics."""
+    try:
+        load_avg = os.getloadavg()[0]
+        mem_output = subprocess.check_output("free -m", shell=True).decode().split('\n')[1].split()
+        ram_used = int(mem_output[2])
+        ram_total = int(mem_output[1])
+        return load_avg, ram_used, ram_total
+    except Exception:
+        return 0.0, 0, 0
+
+
+def draw(instances):
+    """Draw the monitoring dashboard."""
+    load_avg, ram_used, ram_total = get_sys_metrics()
+    vram_used, vram_total, gpu_load = get_gpu_metrics()
+
+    out = [f"{CLEAR}{BOLD}{CYAN}=== OLLAMA MONITOR ==={RESET}"]
+    
+    # System metrics
+    out.append(f"{BOLD}CPU Load:{RESET} {YELLOW}{load_avg:.2f}{RESET} | "
+               f"{BOLD}RAM:{RESET} {MAGENTA}{ram_used}MB/{ram_total}MB{RESET}", )
+    
+    # GPU metrics (if available)
+    if vram_total is not None and gpu_load is not None:
+        load_color = GREEN if gpu_load < 80 else RED
+        out.append(f"{BOLD}GPU Load:{RESET} {load_color}{gpu_load}%{RESET} | "
+                   f"{BOLD}VRAM:{RESET} {CYAN}{vram_used:.0f}MB/{vram_total:.0f}MB{RESET}")
+    
+    out.append("─" * 70)
+
+    # Ollama instances
+    for name, url in instances.items():
+        models = get_ollama_ps(url)
+        status = f"{GREEN}ONLINE{RESET}" if models is not None else f"{RED}OFFLINE{RESET}"
+        out.append(f"\n{BOLD}{name}{RESET} [{status}] - {url}")
+        
+        if models:
+            if len(models) > 0:
+                out.append(f"  {'MODEL':<40} {'SIZE':<12} {'UNTIL':<20}")
+                for m in models:
+                    size_gb = m.get('size', 0) / (1024**3)
+                    until = m.get('expires_at', 'N/A')
+                    if until != 'N/A' and 'T' in until:
+                        # Parse ISO timestamp and show relative time
+                        until = until.split('T')[1].split('.')[0]
+                    
+                    out.append(f"  {m['name'][:39]:<40} {size_gb:>6.1f} GB   {until}")
+            else:
+                out.append(f"  {YELLOW}IDLE{RESET}")
+        elif models is None:
+            out.append(f"  {RED}Connection failed{RESET}")
+
+    print("\n".join(out) + f"\n\n{BOLD}{CYAN}Refreshing... (Ctrl+C to quit){RESET}")
+
+
+def main():
+    print("Discovering Ollama instances...")
+    instances = discover_ollama_instances()
+    
+    if not instances:
+        print(f"{RED}✗ No Ollama instances found.{RESET}")
+        print("  Make sure Ollama is running on the default port (11434)")
+        sys.exit(1)
+    
+    print(f"Found {len(instances)} instance(s). Starting monitor...\n")
+    time.sleep(1)
+    
+    try:
+        while True:
+            draw(instances)
+            time.sleep(1)
+    except KeyboardInterrupt:
+        print("\nMonitor stopped.")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/vram-test.py
+++ b/scripts/vram-test.py
@@ -0,0 +1,274 @@
+#!/usr/bin/env python3
+"""
+Ollama VRAM Test - Evaluate if models fit in VRAM
+Tests models with their configured parameters and reports VRAM usage and CPU offloading.
+"""
+
+import argparse
+import json
+import subprocess
+import sys
+import time
+import urllib.request
+from typing import Dict, List, Optional
+
+
+def get_ollama_url():
+    """Get the Ollama API URL."""
+    return "http://localhost:11434"
+
+
+def get_installed_models() -> List[str]:
+    """Get list of installed Ollama models."""
+    try:
+        result = subprocess.run(
+            ['ollama', 'list'],
+            capture_output=True,
+            text=True,
+            check=True
+        )
+        
+        models = []
+        for line in result.stdout.strip().split('\n')[1:]:  # Skip header
+            if line.strip():
+                name = line.split()[0]
+                models.append(name)
+        
+        return models
+    except subprocess.CalledProcessError:
+        return []
+
+
+def get_model_info(model_name: str) -> Dict:
+    """Get model information from ollama show."""
+    try:
+        result = subprocess.run(
+            ['ollama', 'show', model_name],
+            capture_output=True,
+            text=True,
+            check=True
+        )
+        
+        info = {
+            'size': 'N/A',
+            'quant': 'N/A',
+            'num_ctx': 'N/A',
+            'params': 'N/A'
+        }
+        
+        current_section = None
+        for line in result.stdout.split('\n'):
+            line = line.strip()
+            if not line:
+                continue
+                
+            if line in ["Model", "Parameters"]:
+                current_section = line
+                continue
+            
+            if current_section == "Model":
+                parts = line.split(maxsplit=1)
+                if len(parts) == 2:
+                    k, v = parts[0].lower(), parts[1].strip()
+                    if 'quantization' in k:
+                        info['quant'] = v
+                    elif 'parameters' in k:
+                        info['params'] = v
+            
+            elif current_section == "Parameters":
+                if 'num_ctx' in line.lower():
+                    parts = line.split(maxsplit=1)
+                    if len(parts) == 2:
+                        info['num_ctx'] = parts[1].strip()
+        
+        return info
+    except subprocess.CalledProcessError:
+        return {'size': 'N/A', 'quant': 'N/A', 'num_ctx': 'N/A', 'params': 'N/A'}
+
+
+def test_model_vram(model_name: str) -> Dict:
+    """
+    Test a model's VRAM usage by loading it with a minimal prompt.
+    Returns dict with model stats and VRAM usage.
+    """
+    print(f"Testing {model_name}...", end=' ', flush=True)
+    
+    # Get model info first
+    info = get_model_info(model_name)
+    
+    # Send a minimal test prompt to force model loading
+    url = f"{get_ollama_url()}/api/generate"
+    prompt_data = {
+        "model": model_name,
+        "prompt": "Reply with only: OK",
+        "stream": False
+    }
+    
+    try:
+        req = urllib.request.Request(
+            url,
+            data=json.dumps(prompt_data).encode('utf-8'),
+            headers={'Content-Type': 'application/json'}
+        )
+        
+        # Send request and wait for model to load
+        with urllib.request.urlopen(req, timeout=30) as response:
+            response.read()  # Wait for completion
+        
+        # Give it a moment to stabilize
+        time.sleep(0.5)
+        
+        # Now check /api/ps for VRAM usage
+        ps_url = f"{get_ollama_url()}/api/ps"
+        with urllib.request.urlopen(ps_url, timeout=5) as r:
+            ps_data = json.loads(r.read().decode())
+            models = ps_data.get('models', [])
+            
+            # Find our model in the running models
+            for m in models:
+                if m['name'] == model_name or m['name'].startswith(model_name + ':'):
+                    size_bytes = m.get('size', 0)
+                    size_vram = m.get('size_vram', 0)
+                    
+                    # Calculate VRAM usage in GB
+                    vram_gb = size_vram / (1024**3) if size_vram > 0 else 0
+                    total_gb = size_bytes / (1024**3) if size_bytes > 0 else 0
+                    
+                    # Calculate offload percentage (how much is on CPU)
+                    if size_bytes > 0:
+                        offload_pct = ((size_bytes - size_vram) / size_bytes) * 100
+                    else:
+                        offload_pct = 0
+                    
+                    print("✓")
+                    
+                    return {
+                        'model': model_name,
+                        'params': info['params'],
+                        'size_gb': total_gb,
+                        'quant': info['quant'],
+                        'num_ctx': info['num_ctx'],
+                        'vram_gb': vram_gb,
+                        'offload_pct': offload_pct,
+                        'success': True
+                    }
+        
+        # Model not found in ps output
+        print("✗ (not in ps)")
+        return {
+            'model': model_name,
+            'params': info['params'],
+            'size_gb': 0,
+            'quant': info['quant'],
+            'num_ctx': info['num_ctx'],
+            'vram_gb': 0,
+            'offload_pct': 0,
+            'success': False
+        }
+        
+    except Exception as e:
+        print(f"✗ ({str(e)[:30]})")
+        return {
+            'model': model_name,
+            'params': info['params'],
+            'size_gb': 0,
+            'quant': info['quant'],
+            'num_ctx': info['num_ctx'],
+            'vram_gb': 0,
+            'offload_pct': 0,
+            'success': False
+        }
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Test Ollama models for VRAM usage and CPU offloading',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Test all installed models
+  %(prog)s
+  
+  # Test a specific model
+  %(prog)s ministral-3:3b-instruct-2512-q5_k_m
+        """
+    )
+    
+    parser.add_argument(
+        'model',
+        nargs='?',
+        help='Specific model to test (optional, tests all if omitted)'
+    )
+    
+    args = parser.parse_args()
+    
+    # Check if ollama is available
+    try:
+        subprocess.run(['ollama', '--version'], capture_output=True, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        print("✗ Error: 'ollama' command not found. Please install Ollama first.")
+        sys.exit(1)
+    
+    # Determine which models to test
+    if args.model:
+        models = [args.model]
+    else:
+        models = get_installed_models()
+        if not models:
+            print("✗ No models found")
+            sys.exit(1)
+        print(f"Found {len(models)} installed model(s)\n")
+    
+    # Test each model
+    results = []
+    for model in models:
+        result = test_model_vram(model)
+        results.append(result)
+    
+    # Display results table
+    print("\n" + "="*110)
+    print("VRAM USAGE TEST RESULTS")
+    print("="*110)
+    
+    # Column widths
+    w = {'m': 38, 'p': 8, 's': 10, 'q': 10, 'ctx': 10, 'v': 10, 'o': 12}
+    
+    header = (f"{'MODEL':<{w['m']}} {'PARAMS':<{w['p']}} {'SIZE':<{w['s']}} "
+              f"{'QUANT':<{w['q']}} {'NUM_CTX':<{w['ctx']}} {'VRAM':>{w['v']}} {'OFFLOAD':>{w['o']}}")
+    
+    print(header)
+    print("-" * 110)
+    
+    for r in results:
+        # Truncate long model names
+        name = (r['model'][:w['m']-2] + '..') if len(r['model']) > w['m'] else r['model']
+        
+        # Format values
+        size_str = f"{r['size_gb']:.1f} GB" if r['size_gb'] > 0 else "N/A"
+        vram_str = f"{r['vram_gb']:.1f} GB" if r['vram_gb'] > 0 else "N/A"
+        
+        # Offload status
+        if r['success']:
+            if r['offload_pct'] > 0:
+                offload_str = f"{r['offload_pct']:.1f}% CPU"
+            else:
+                offload_str = "0% (GPU only)"
+        else:
+            offload_str = "Failed"
+        
+        print(f"{name:<{w['m']}} {r['params']:<{w['p']}} {size_str:<{w['s']}} "
+              f"{r['quant']:<{w['q']}} {r['num_ctx']:<{w['ctx']}} {vram_str:>{w['v']}} {offload_str:>{w['o']}}")
+    
+    # Summary
+    successful = sum(1 for r in results if r['success'])
+    with_offload = sum(1 for r in results if r['success'] and r['offload_pct'] > 0)
+    
+    print("\n" + "="*110)
+    print(f"Tested: {len(results)} | Successful: {successful} | CPU Offloading: {with_offload}")
+    
+    if with_offload > 0:
+        print(f"\n⚠  {with_offload} model(s) using CPU offloading - consider reducing num_ctx or using smaller quantization")
+
+
+if __name__ == '__main__':
+    main()
--- a/static/script.js
+++ b/static/script.js
--- a/static/style.css
+++ b/static/style.css
@@ -0,0 +1,750 @@
+/* Ollama Utils Web Interface - Clean & Simple CSS */
+
+* {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+
+:root {
+    --bg-primary: #1e1e1e;
+    --bg-secondary: #2d2d2d;
+    --bg-tertiary: #3d3d3d;
+    --text-primary: #e0e0e0;
+    --text-secondary: #b0b0b0;
+    --accent-primary: #4a9eff;
+    --accent-hover: #6eb3ff;
+    --success: #4caf50;
+    --warning: #ff9800;
+    --danger: #f44336;
+    --border: #404040;
+}
+
+body {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
+    background-color: var(--bg-primary);
+    color: var(--text-primary);
+    line-height: 1.6;
+}
+
+.container {
+    display: flex;
+    height: 100vh;
+    overflow: hidden;
+}
+
+/* Sidebar */
+.sidebar {
+    width: 300px;
+    background-color: var(--bg-secondary);
+    padding: 20px;
+    overflow-y: auto;
+    border-right: 1px solid var(--border);
+}
+
+.sidebar h1 {
+    font-size: 24px;
+    margin-bottom: 20px;
+    color: var(--accent-primary);
+}
+
+.sidebar h2 {
+    font-size: 16px;
+    margin: 20px 0 10px 0;
+    color: var(--text-secondary);
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+
+.stats-section {
+    margin-bottom: 30px;
+}
+
+.stat-group {
+    margin-bottom: 15px;
+}
+
+.stat-label {
+    font-size: 12px;
+    color: var(--text-secondary);
+    margin-bottom: 4px;
+}
+
+.stat-value {
+    font-size: 18px;
+    font-weight: 600;
+    color: var(--text-primary);
+}
+
+.progress-bar {
+    width: 100%;
+    height: 6px;
+    background-color: var(--bg-tertiary);
+    border-radius: 3px;
+    margin-top: 5px;
+    overflow: hidden;
+}
+
+.progress-fill {
+    height: 100%;
+    background-color: var(--accent-primary);
+    transition: width 0.3s ease;
+}
+
+.running-models-section {
+    margin-top: 20px;
+}
+
+.running-model {
+    background-color: var(--bg-tertiary);
+    padding: 10px;
+    margin-bottom: 8px;
+    border-radius: 4px;
+    font-size: 13px;
+}
+
+.running-model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+    word-break: break-all;
+}
+
+.running-model-stats {
+    font-size: 11px;
+    color: var(--text-secondary);
+}
+
+.no-models {
+    color: var(--text-secondary);
+    font-size: 13px;
+    font-style: italic;
+}
+
+/* Main Content */
+.main-content {
+    flex: 1;
+    overflow-y: auto;
+    padding: 20px;
+}
+
+/* Tabs */
+.tabs {
+    display: flex;
+    gap: 10px;
+    margin-bottom: 20px;
+    border-bottom: 1px solid var(--border);
+}
+
+.tab-button {
+    padding: 12px 24px;
+    background: none;
+    border: none;
+    color: var(--text-secondary);
+    cursor: pointer;
+    font-size: 14px;
+    font-weight: 500;
+    border-bottom: 2px solid transparent;
+    transition: all 0.2s;
+}
+
+.tab-button:hover {
+    color: var(--text-primary);
+}
+
+.tab-button.active {
+    color: var(--accent-primary);
+    border-bottom-color: var(--accent-primary);
+}
+
+.tab-content {
+    display: none;
+}
+
+.tab-content.active {
+    display: block;
+}
+
+.tab-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    margin-bottom: 20px;
+}
+
+.tab-header h2 {
+    font-size: 22px;
+}
+
+.tab-actions {
+    display: flex;
+    gap: 10px;
+}
+
+/* Buttons */
+.btn {
+    padding: 10px 20px;
+    border: none;
+    border-radius: 4px;
+    cursor: pointer;
+    font-size: 14px;
+    font-weight: 500;
+    transition: all 0.2s;
+}
+
+.btn-primary {
+    background-color: var(--accent-primary);
+    color: white;
+}
+
+.btn-primary:hover {
+    background-color: var(--accent-hover);
+}
+
+.btn-secondary {
+    background-color: var(--bg-tertiary);
+    color: var(--text-primary);
+}
+
+.btn-secondary:hover {
+    background-color: var(--bg-secondary);
+}
+
+.btn-danger {
+    background-color: var(--danger);
+    color: white;
+}
+
+.btn-danger:hover {
+    background-color: #d32f2f;
+}
+
+.btn-small {
+    padding: 6px 12px;
+    font-size: 12px;
+}
+
+.button-group {
+    display: flex;
+    gap: 10px;
+    align-items: center;
+}
+
+/* Installation Progress */
+.install-progress {
+    background-color: var(--bg-primary);
+    border: 1px solid var(--border);
+    border-radius: 4px;
+    padding: 15px;
+    margin: 10px 0;
+}
+
+.install-progress-header {
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    margin-bottom: 10px;
+    font-weight: 500;
+    color: var(--accent-primary);
+}
+
+.install-status-icon {
+    font-size: 18px;
+}
+
+.install-status-text {
+    font-size: 14px;
+}
+
+.install-progress-bar-container {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    margin-bottom: 10px;
+}
+
+.install-progress-bar {
+    flex: 1;
+    height: 24px;
+    background-color: var(--bg-secondary);
+    border: 1px solid var(--border);
+    border-radius: 4px;
+    overflow: hidden;
+}
+
+.install-progress-bar-fill {
+    height: 100%;
+    background: linear-gradient(90deg, var(--accent-primary), var(--accent-hover));
+    transition: width 0.3s ease;
+    display: flex;
+    align-items: center;
+    justify-content: flex-end;
+}
+
+.install-progress-percent {
+    font-size: 13px;
+    font-weight: 500;
+    color: var(--text-primary);
+    min-width: 48px;
+    text-align: right;
+}
+
+
+.install-progress-actions {
+    display: flex;
+    justify-content: flex-end;
+}
+
+/* Models List */
+.models-list {
+    display: grid;
+    gap: 15px;
+}
+
+.model-card {
+    background-color: var(--bg-secondary);
+    border: 1px solid var(--border);
+    border-radius: 6px;
+    padding: 15px;
+    transition: border-color 0.2s;
+}
+
+.model-card:hover {
+    border-color: var(--accent-primary);
+}
+
+.model-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: flex-start;
+    margin-bottom: 10px;
+}
+
+.model-name {
+    font-size: 16px;
+    font-weight: 600;
+    color: var(--text-primary);
+    word-break: break-all;
+}
+
+.model-actions {
+    display: flex;
+    gap: 5px;
+}
+
+.model-info {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
+    gap: 10px;
+    margin-top: 10px;
+}
+
+.model-info-item {
+    font-size: 13px;
+}
+
+.model-info-label {
+    color: var(--text-secondary);
+    font-size: 11px;
+    margin-bottom: 2px;
+}
+
+.model-info-value {
+    color: var(--text-primary);
+}
+
+.model-badges {
+    display: flex;
+    gap: 5px;
+    margin-top: 10px;
+    flex-wrap: wrap;
+}
+
+.badge {
+    padding: 4px 8px;
+    border-radius: 3px;
+    font-size: 11px;
+    font-weight: 500;
+}
+
+.badge-modelfile {
+    background-color: var(--success);
+    color: white;
+}
+
+.badge-ollama {
+    background-color: var(--accent-primary);
+    color: white;
+}
+
+.badge-capability {
+    background-color: var(--bg-tertiary);
+    color: var(--text-secondary);
+}
+
+.badge-warning {
+    background-color: var(--warning);
+    color: white;
+    margin-left: 8px;
+}
+
+/* Model Card States */
+.model-not-installed {
+    opacity: 0.8;
+    border-color: var(--warning);
+}
+
+.model-not-installed:hover {
+    opacity: 1;
+}
+
+/* Results Table */
+.results-table {
+    width: 100%;
+    border-collapse: collapse;
+    margin-top: 15px;
+}
+
+.results-table th,
+.results-table td {
+    padding: 10px;
+    text-align: left;
+    border-bottom: 1px solid var(--border);
+}
+
+.results-table th {
+    background-color: var(--bg-tertiary);
+    font-weight: 600;
+    font-size: 13px;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+
+.results-table td {
+    font-size: 14px;
+}
+
+.results-table tbody tr:hover {
+    background-color: rgba(74, 158, 255, 0.05);
+}
+
+.results-table code {
+    background-color: var(--bg-tertiary);
+    padding: 2px 6px;
+    border-radius: 3px;
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Consolas', monospace;
+    font-size: 12px;
+}
+
+/* Loading */
+.loading {
+    text-align: center;
+    padding: 40px;
+    color: var(--text-secondary);
+}
+
+/* Performance Tools */
+.performance-tools {
+    display: grid;
+    gap: 30px;
+}
+
+.tool-section {
+    background-color: var(--bg-secondary);
+    border: 1px solid var(--border);
+    border-radius: 6px;
+    padding: 20px;
+}
+
+.tool-section h3 {
+    font-size: 18px;
+    margin-bottom: 10px;
+}
+
+.tool-section p {
+    color: var(--text-secondary);
+    font-size: 14px;
+    margin-bottom: 15px;
+}
+
+.form-group {
+    margin-bottom: 15px;
+}
+
+.form-group label {
+    display: block;
+    margin-bottom: 5px;
+    font-size: 13px;
+    font-weight: 500;
+}
+
+.form-control {
+    width: 100%;
+    padding: 10px;
+    background-color: var(--bg-primary);
+    border: 1px solid var(--border);
+    border-radius: 4px;
+    color: var(--text-primary);
+    font-size: 14px;
+    font-family: inherit;
+}
+
+.form-control:focus {
+    outline: none;
+    border-color: var(--accent-primary);
+}
+
+.form-group small {
+    display: block;
+    margin-top: 4px;
+    font-size: 12px;
+    color: var(--text-secondary);
+}
+
+.modelfile-editor {
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', 'Consolas', monospace;
+    font-size: 13px;
+    line-height: 1.5;
+    resize: vertical;
+}
+
+.results-box, .output-box {
+    margin-top: 15px;
+    padding: 15px;
+    background-color: var(--bg-primary);
+    border: 1px solid var(--border);
+    border-radius: 4px;
+    font-size: 13px;
+    max-height: 300px;
+    overflow-y: auto;
+}
+
+.results-box:empty, .output-box:empty {
+    display: none;
+}
+
+.result-item {
+    padding: 8px 0;
+    border-bottom: 1px solid var(--border);
+}
+
+.result-item:last-child {
+    border-bottom: none;
+}
+
+/* Modals */
+.modal {
+    display: none;
+    position: fixed;
+    z-index: 1000;
+    left: 0;
+    top: 0;
+    width: 100%;
+    height: 100%;
+    background-color: rgba(0, 0, 0, 0.7);
+}
+
+.modal.active {
+    display: flex;
+    justify-content: center;
+    align-items: center;
+}
+
+.modal-content {
+    background-color: var(--bg-secondary);
+    border-radius: 8px;
+    width: 90%;
+    max-width: 800px;
+    max-height: 90vh;
+    display: flex;
+    flex-direction: column;
+}
+
+.modal-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    padding: 20px;
+    border-bottom: 1px solid var(--border);
+}
+
+.modal-header h2 {
+    font-size: 20px;
+    margin: 0;
+}
+
+.modal-close {
+    background: none;
+    border: none;
+    color: var(--text-secondary);
+    font-size: 28px;
+    cursor: pointer;
+    padding: 0;
+    width: 30px;
+    height: 30px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+}
+
+.modal-close:hover {
+    color: var(--text-primary);
+}
+
+.modal-body {
+    padding: 20px;
+    overflow-y: auto;
+}
+
+.modal-actions {
+    display: flex;
+    gap: 10px;
+    margin-top: 15px;
+}
+
+/* Install Options */
+.install-options {
+    display: grid;
+    grid-template-columns: 1fr 1fr;
+    gap: 15px;
+    margin-bottom: 20px;
+}
+
+.install-option {
+    background-color: var(--bg-primary);
+    border: 2px solid var(--border);
+    border-radius: 6px;
+    padding: 20px;
+    cursor: pointer;
+    transition: all 0.2s;
+    text-align: left;
+}
+
+.install-option:hover {
+    border-color: var(--accent-primary);
+}
+
+.install-option.active {
+    border-color: var(--accent-primary);
+    background-color: rgba(74, 158, 255, 0.1);
+}
+
+.install-option h3 {
+    font-size: 16px;
+    margin-bottom: 5px;
+}
+
+.install-option p {
+    font-size: 13px;
+    color: var(--text-secondary);
+    margin: 0;
+}
+
+.install-form {
+    display: none;
+}
+
+.install-form.active {
+    display: block;
+}
+
+.info-text {
+    background-color: rgba(74, 158, 255, 0.1);
+    border-left: 3px solid var(--accent-primary);
+    padding: 10px;
+    margin-bottom: 15px;
+    font-size: 13px;
+}
+
+.info-text a {
+    color: var(--accent-primary);
+    text-decoration: none;
+}
+
+.info-text a:hover {
+    text-decoration: underline;
+}
+
+.warning-text {
+    background-color: rgba(255, 152, 0, 0.1);
+    border-left: 3px solid var(--warning);
+    padding: 10px;
+    margin-bottom: 15px;
+    font-size: 13px;
+}
+
+.warning-text a {
+    color: var(--warning);
+    text-decoration: none;
+    font-weight: 500;
+}
+
+.warning-text a:hover {
+    text-decoration: underline;
+}
+
+.modelfile-section {
+    margin-top: 20px;
+}
+
+.modelfile-section h3 {
+    font-size: 16px;
+    margin-bottom: 10px;
+}
+
+/* Success/Error Messages */
+.success-message {
+    color: var(--success);
+    padding: 10px;
+    background-color: rgba(76, 175, 80, 0.1);
+    border-left: 3px solid var(--success);
+    border-radius: 4px;
+    margin-top: 10px;
+}
+
+.error-message {
+    color: var(--danger);
+    padding: 10px;
+    background-color: rgba(244, 67, 54, 0.1);
+    border-left: 3px solid var(--danger);
+    border-radius: 4px;
+    margin-top: 10px;
+}
+
+/* Scrollbar */
+::-webkit-scrollbar {
+    width: 10px;
+}
+
+::-webkit-scrollbar-track {
+    background: var(--bg-primary);
+}
+
+::-webkit-scrollbar-thumb {
+    background: var(--bg-tertiary);
+    border-radius: 5px;
+}
+
+::-webkit-scrollbar-thumb:hover {
+    background: var(--border);
+}
+
+/* Responsive */
+@media (max-width: 768px) {
+    .container {
+        flex-direction: column;
+    }
+    
+    .sidebar {
+        width: 100%;
+        border-right: none;
+        border-bottom: 1px solid var(--border);
+    }
+    
+    .install-options {
+        grid-template-columns: 1fr;
+    }
+    
+    .model-info {
+        grid-template-columns: 1fr;
+    }
+}
--- a/templates/index.html
+++ b/templates/index.html
@@ -0,0 +1,259 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Ollama Utils - Web Interface</title>
+    <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
+</head>
+<body>
+    <div class="container">
+        <!-- Sidebar -->
+        <aside class="sidebar">
+            <h1>Ollama Utils</h1>
+            
+            <div class="stats-section">
+                <h2>System Status</h2>
+                
+                <div class="stat-group">
+                    <div class="stat-label">CPU Load</div>
+                    <div class="stat-value" id="cpu-load">--</div>
+                </div>
+                
+                <div class="stat-group">
+                    <div class="stat-label">RAM Usage</div>
+                    <div class="stat-value" id="ram-usage">--</div>
+                    <div class="progress-bar">
+                        <div class="progress-fill" id="ram-progress"></div>
+                    </div>
+                </div>
+                
+                <div class="stat-group">
+                    <div class="stat-label">VRAM Usage</div>
+                    <div class="stat-value" id="vram-usage">--</div>
+                    <div class="progress-bar">
+                        <div class="progress-fill" id="vram-progress"></div>
+                    </div>
+                </div>
+                
+                <div class="stat-group">
+                    <div class="stat-label">GPU Load</div>
+                    <div class="stat-value" id="gpu-load">--</div>
+                </div>
+            </div>
+            
+            <div class="running-models-section">
+                <h2>Running Models</h2>
+                <div id="running-models-list">
+                    <div class="no-models">No models loaded</div>
+                </div>
+            </div>
+        </aside>
+        
+        <!-- Main Content -->
+        <main class="main-content">
+            <!-- Tabs -->
+            <div class="tabs">
+                <button class="tab-button active" data-tab="models">Model Management</button>
+                <button class="tab-button" data-tab="performance">Performance Assessment</button>
+            </div>
+            
+            <!-- Model Management Tab -->
+            <div class="tab-content active" id="tab-models">
+                <div class="tab-header">
+                    <h2>Available Models</h2>
+                    <div class="tab-actions">
+                        <button class="btn btn-primary" id="btn-install">Import Model</button>
+                        <button class="btn btn-secondary" id="btn-refresh-models">Refresh</button>
+                    </div>
+                </div>
+                
+                <div class="models-list" id="models-list">
+                    <div class="loading">Loading models...</div>
+                </div>
+            </div>
+            
+            <!-- Performance Assessment Tab -->
+            <div class="tab-content" id="tab-performance">
+                <div class="tab-header">
+                    <h2>Performance Assessment</h2>
+                </div>
+                
+                <div class="performance-tools">
+                    <div class="tool-section">
+                        <h3>VRAM Test</h3>
+                        <p>Test a model's VRAM usage and CPU offloading. This will load the model with a minimal prompt and report actual VRAM consumption.</p>
+                        
+                        <div class="form-group">
+                            <label for="vram-test-model">Select Model:</label>
+                            <select id="vram-test-model" class="form-control">
+                                <option value="">-- Select a model --</option>
+                            </select>
+                        </div>
+                        
+                        <button class="btn btn-primary" id="btn-run-vram-test">Run VRAM Test</button>
+                        
+                        <div id="vram-test-results" class="results-box"></div>
+                    </div>
+                    
+                    <div class="tool-section">
+                        <h3>Context Optimizer</h3>
+                        <p>Find the optimal context size (num_ctx) for a model based on available VRAM. This iteratively tests different context sizes.</p>
+                        
+                        <div class="form-group">
+                            <label for="optimizer-model">Select Model:</label>
+                            <select id="optimizer-model" class="form-control">
+                                <option value="">-- Select a model --</option>
+                            </select>
+                        </div>
+                        
+                        <div class="form-group">
+                            <label for="optimizer-overhead">GPU Overhead (GB):</label>
+                            <input type="number" id="optimizer-overhead" class="form-control" value="0.0" step="0.1" min="0">
+                        </div>
+                        
+                        <div class="form-group">
+                            <label for="optimizer-max-turns">Max Iterations (0 = unlimited):</label>
+                            <input type="number" id="optimizer-max-turns" class="form-control" value="0" min="0">
+                        </div>
+                        
+                        <div class="button-group">
+                            <button class="btn btn-primary" id="btn-run-optimizer">Run Optimizer</button>
+                            <button class="btn btn-danger" id="btn-stop-optimizer" style="display:none;">Stop</button>
+                        </div>
+                        
+                        <div id="optimizer-results" class="results-box"></div>
+                    </div>
+                </div>
+            </div>
+        </main>
+    </div>
+    
+    <!-- Import Model Modal -->
+    <div id="install-modal" class="modal">
+        <div class="modal-content">
+            <div class="modal-header">
+                <h2>Import Model</h2>
+                <button class="modal-close" id="modal-close">&times;</button>
+            </div>
+            
+            <div class="modal-body">
+                <div class="install-options">
+                    <button class="install-option active" data-option="ollama">
+                        <h3>Ollama Library</h3>
+                        <p>Install from official Ollama model library</p>
+                    </button>
+                    <button class="install-option" data-option="huggingface">
+                        <h3>HuggingFace GGUF</h3>
+                        <p>Import GGUF model from HuggingFace</p>
+                    </button>
+                </div>
+                
+                <!-- Ollama Library Install -->
+                <div class="install-form active" id="install-ollama">
+                    <p class="info-text">
+                        Browse available models at <a href="https://ollama.com/library" target="_blank">ollama.com/library</a>
+                    </p>
+                    
+                    <div class="form-group">
+                        <label for="ollama-model-name">Model Name:</label>
+                        <input type="text" id="ollama-model-name" class="form-control" placeholder="e.g., llama2, mistral, phi">
+                        <small>Enter the model name with optional tag (e.g., llama2:7b)</small>
+                    </div>
+                    
+                    <button class="btn btn-primary" id="btn-install-ollama">Pull Model</button>
+                    
+                    <div id="ollama-install-output" class="output-box"></div>
+                </div>
+                
+                <!-- HuggingFace Install -->
+                <div class="install-form" id="install-huggingface">
+                    <p class="info-text">
+                        Paste a HuggingFace repository URL or direct link to a GGUF file.
+                    </p>
+                    
+                    <div class="form-group">
+                        <label for="hf-url">HuggingFace URL:</label>
+                        <input type="text" id="hf-url" class="form-control" placeholder="https://huggingface.co/TheBloke/Llama-2-7B-GGUF">
+                        <small>Repository URL or direct link to .gguf file</small>
+                    </div>
+                    
+                    <button class="btn btn-primary" id="btn-fetch-hf">Fetch Model Info</button>
+                    
+                    <div id="hf-install-output" class="output-box"></div>
+                    
+                    <!-- File selection section (for repo URLs) -->
+                    <div id="hf-file-select-section" style="display: none; margin-top: 15px;"></div>
+                    
+                    <div id="hf-modelfile-section" class="modelfile-section" style="display: none;">
+                        <h3>Modelfile Configuration</h3>
+                        <p class="warning-text">
+                            ⚠️ Please review and customize the Modelfile below. Refer to the 
+                            <a id="hf-repo-link" href="#" target="_blank">manufacturer's recommendations</a> 
+                            for optimal parameters.
+                        </p>
+                        <p class="info-text">
+                            💡 <strong>Tip:</strong> The <code># capabilities:</code> comment documents expected model features 
+                            like tool calling or vision. Ollama detects these automatically from the GGUF file metadata. 
+                            This comment helps you track which models support which features.
+                        </p>
+                        
+                        <div class="form-group">
+                            <label for="hf-model-name">Model Name:</label>
+                            <input type="text" id="hf-model-name" class="form-control" placeholder="model:tag">
+                        </div>
+                        
+                        <div class="form-group">
+                            <label for="hf-modelfile">Modelfile:</label>
+                            <textarea id="hf-modelfile" class="form-control modelfile-editor" rows="20"></textarea>
+                        </div>
+                        
+                        <button class="btn btn-primary" id="btn-create-hf-model">Create Model</button>
+                    </div>
+                    
+                    <div id="hf-install-output" class="output-box"></div>
+                </div>
+            </div>
+        </div>
+    </div>
+    
+    <!-- Model Details Modal -->
+    <div id="model-details-modal" class="modal">
+        <div class="modal-content">
+            <div class="modal-header">
+                <h2 id="model-details-title">Model Details</h2>
+                <button class="modal-close" id="model-details-close">&times;</button>
+            </div>
+            
+            <div class="modal-body">
+                <div id="model-details-content"></div>
+            </div>
+        </div>
+    </div>
+    
+    <!-- Modelfile Editor Modal -->
+    <div id="modelfile-editor-modal" class="modal">
+        <div class="modal-content">
+            <div class="modal-header">
+                <h2 id="modelfile-editor-title">Edit Modelfile</h2>
+                <button class="modal-close" id="modelfile-editor-close">&times;</button>
+            </div>
+            
+            <div class="modal-body">
+                <div class="form-group">
+                    <textarea id="modelfile-editor-content" class="form-control modelfile-editor" rows="25"></textarea>
+                </div>
+                
+                <div class="modal-actions">
+                    <button class="btn btn-primary" id="btn-save-modelfile">Save Changes</button>
+                    <button class="btn btn-secondary" id="btn-cancel-modelfile">Cancel</button>
+                </div>
+                
+                <div id="modelfile-save-output" class="output-box"></div>
+            </div>
+        </div>
+    </div>
+    
+    <script src="{{ url_for('static', filename='script.js') }}"></script>
+</body>
+</html>
--- a/web_app.py
+++ b/web_app.py
@@ -0,0 +1,993 @@
+#!/usr/bin/env python3
+"""
+Ollama-Utils Web Interface
+A comprehensive web interface for managing Ollama models and monitoring system resources.
+"""
+
+import json
+import os
+import re
+import subprocess
+import sys
+import urllib.request
+import urllib.error
+import threading
+import time
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+from flask import Flask, render_template, jsonify, request, send_from_directory
+from urllib.parse import urlparse
+
+# Import utilities from existing scripts
+sys.path.append(os.path.join(os.path.dirname(__file__), 'scripts'))
+
+# Import existing CLI tools
+import importlib.util
+
+# Load vram-test module
+vram_test_spec = importlib.util.spec_from_file_location("vram_test", os.path.join(os.path.dirname(__file__), 'scripts', 'vram-test.py'))
+vram_test_module = importlib.util.module_from_spec(vram_test_spec)
+vram_test_spec.loader.exec_module(vram_test_module)
+
+# Load context-optimizer module
+context_optimizer_spec = importlib.util.spec_from_file_location("context_optimizer", os.path.join(os.path.dirname(__file__), 'scripts', 'context-optimizer.py'))
+context_optimizer_module = importlib.util.module_from_spec(context_optimizer_spec)
+context_optimizer_spec.loader.exec_module(context_optimizer_module)
+
+# Load model-info module
+model_info_spec = importlib.util.spec_from_file_location("model_info", os.path.join(os.path.dirname(__file__), 'scripts', 'model-info.py'))
+model_info_module = importlib.util.module_from_spec(model_info_spec)
+model_info_spec.loader.exec_module(model_info_module)
+
+# Load hf-llm-install module
+hf_install_spec = importlib.util.spec_from_file_location("hf_install", os.path.join(os.path.dirname(__file__), 'scripts', 'hf-llm-install.py'))
+hf_install_module = importlib.util.module_from_spec(hf_install_spec)
+hf_install_spec.loader.exec_module(hf_install_module)
+
+app = Flask(__name__)
+app.config['MODELFILE_REPO'] = os.path.join(os.path.dirname(__file__), 'modelfile-repo')
+
+# Global state for background installations
+install_jobs = {}
+install_lock = threading.Lock()
+
+# ===== UTILITY FUNCTIONS =====
+
+def get_ollama_url():
+    """Get the Ollama API URL."""
+    return "http://localhost:11434"
+
+
+def get_gpu_metrics() -> Tuple[Optional[float], Optional[float], Optional[int]]:
+    """Get GPU VRAM and load metrics."""
+    try:
+        device_paths = [
+            "/sys/class/drm/card1/device/",
+            "/sys/class/drm/card0/device/",
+        ]
+        
+        for base_path in device_paths:
+            if not os.path.exists(base_path):
+                continue
+                
+            try:
+                with open(base_path + "mem_info_vram_used", "r") as f:
+                    used = int(f.read().strip()) / 1024 / 1024
+                with open(base_path + "mem_info_vram_total", "r") as f:
+                    total = int(f.read().strip()) / 1024 / 1024
+                with open(base_path + "gpu_busy_percent", "r") as f:
+                    load = int(f.read().strip())
+                
+                # Sanity check
+                if load == 99 and used < (total * 0.1):
+                    load = 0
+                    
+                return used, total, load
+            except:
+                continue
+                
+        return None, None, None
+    except:
+        return None, None, None
+
+
+def get_sys_metrics() -> Tuple[float, int, int]:
+    """Get system CPU and RAM metrics."""
+    try:
+        load_avg = os.getloadavg()[0]
+        mem_output = subprocess.check_output("free -m", shell=True).decode().split('\n')[1].split()
+        ram_used = int(mem_output[2])
+        ram_total = int(mem_output[1])
+        return load_avg, ram_used, ram_total
+    except Exception:
+        return 0.0, 0, 0
+
+
+def get_model_info_detailed(model_name: str) -> Dict:
+    """Get detailed model information from 'ollama show'. Uses model-info.py logic."""
+    # Get basic list info first
+    try:
+        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, check=True)
+        disk_size = 'N/A'
+        for line in result.stdout.strip().split('\n')[1:]:
+            if line.strip():
+                parts = line.split()
+                if len(parts) >= 3 and parts[0] == model_name:
+                    disk_size = parts[2]
+                    break
+    except:
+        disk_size = 'N/A'
+    
+    # Use the existing get_model_info function from model-info.py
+    info = model_info_module.get_model_info(model_name, disk_size)
+    
+    # Convert to expected format (model-info uses slightly different keys)
+    return {
+        'name': model_name,
+        'family': info.get('family', 'N/A'),
+        'params': info.get('params_str', 'N/A'),
+        'quant': info.get('quant', 'N/A'),
+        'max_context': info.get('context', 0),
+        'context_used': info.get('context_used', 0),
+        'capabilities': [cap for cap in info.get('caps', [])],
+        'license': 'N/A',
+        'system_prompt': '',
+        'vram_estimate': info.get('vram', 'N/A')
+    }
+
+
+def check_modelfile_exists(model_name: str) -> Optional[str]:
+    """Check if a Modelfile exists for this model in the modelfile-repo directory."""
+    modelfile_dir = app.config['MODELFILE_REPO']
+    if not os.path.exists(modelfile_dir):
+        return None
+    
+    # Try exact match first
+    modelfile_path = os.path.join(modelfile_dir, f"{model_name}.Modelfile")
+    if os.path.exists(modelfile_path):
+        return modelfile_path
+    
+    # Try with colons replaced by dashes (ministral-3:3b -> ministral-3-3b)
+    normalized_name = model_name.replace(':', '-')
+    modelfile_path = os.path.join(modelfile_dir, f"{normalized_name}.Modelfile")
+    if os.path.exists(modelfile_path):
+        return modelfile_path
+    
+    return None
+
+
+def parse_modelfile_metadata(modelfile_path: str) -> Dict:
+    """Parse metadata from a Modelfile using hf-llm-install.py logic."""
+    try:
+        # Use the existing parse_modelfile function from hf-llm-install.py
+        model_info = hf_install_module.parse_modelfile(modelfile_path)
+        
+        if not model_info:
+            return None
+        
+        # Extract quantization and other params from the modelfile content
+        quantization = None
+        num_ctx = None
+        family = None
+        params = None
+        
+        with open(modelfile_path, 'r') as f:
+            content = f.read()
+            
+            # Extract quantization
+            quant_match = re.search(r'#\s*quantization:\s*([a-zA-Z0-9_]+)', content)
+            if quant_match:
+                quantization = quant_match.group(1).upper()
+            else:
+                # Extract from filename if not specified
+                gguf_filename = model_info.get('gguf_filename', '')
+                quant_pattern = re.search(r'[_-](Q[0-9]+_[KLM]+(?:_[LSM])?)\\.gguf', gguf_filename, re.IGNORECASE)
+                if quant_pattern:
+                    quantization = quant_pattern.group(1).upper()
+            
+            # Extract num_ctx
+            ctx_match = re.search(r'PARAMETER\s+num_ctx\s+(\d+)', content)
+            if ctx_match:
+                num_ctx = int(ctx_match.group(1))
+        
+        # Extract params and family from model name
+        model_name = model_info['model_name']
+        # Pattern: modelbase:Xb-variant  (e.g., "ministral-3:3b-instruct-2512-q5_k_m")
+        params_match = re.search(r':(\d+)b', model_name, re.IGNORECASE)
+        if params_match:
+            params = params_match.group(1) + 'B'
+        
+        # Extract family from base name
+        if ':' in model_name:
+            family = model_name.split(':')[0].upper()
+        
+        # Get capabilities from model_info (parsed by hf_install_module)
+        capabilities = model_info.get('capabilities', [])
+        
+        # Convert to expected format
+        return {
+            'path': modelfile_path,
+            'filename': os.path.basename(modelfile_path),
+            'model_name': model_info['model_name'],
+            'hf_upstream': model_info.get('hf_url'),
+            'quantization': quantization or 'unspecified',
+            'sha256': model_info.get('sha256'),
+            'num_ctx': num_ctx or 0,
+            'family': family or 'Unknown',
+            'params': params or 'Unknown',
+            'capabilities': capabilities or []
+        }
+    except Exception as e:
+        return None
+
+
+def get_all_modelfiles() -> List[Dict]:
+    """Get all modelfiles from the modelfile-repo directory."""
+    modelfile_dir = app.config['MODELFILE_REPO']
+    if not os.path.exists(modelfile_dir):
+        return []
+    
+    modelfiles = []
+    for filename in os.listdir(modelfile_dir):
+        if filename.endswith('.Modelfile'):
+            filepath = os.path.join(modelfile_dir, filename)
+            metadata = parse_modelfile_metadata(filepath)
+            if metadata:
+                modelfiles.append(metadata)
+    
+    return modelfiles
+
+
+def run_install_job(job_id: str, modelfile_path: str):
+    """Run installation in background thread."""
+    with install_lock:
+        install_jobs[job_id]['status'] = 'running'
+        install_jobs[job_id]['progress'] = 'Starting installation...'
+    
+    # Progress callback
+    def update_progress(message):
+        with install_lock:
+            install_jobs[job_id]['progress'] = message
+    
+    # Cancellation callback
+    def should_cancel():
+        with install_lock:
+            return install_jobs[job_id].get('cancelled', False)
+    
+    try:
+        success, skipped, model_name = hf_install_module.install_model(
+            modelfile_path,
+            dry_run=False,
+            skip_existing=False,
+            existing_models=None,
+            should_cancel=should_cancel,
+            progress_callback=update_progress
+        )
+        
+        with install_lock:
+            if success:
+                install_jobs[job_id]['status'] = 'completed'
+                install_jobs[job_id]['model_name'] = model_name
+                install_jobs[job_id]['progress'] = f'Successfully installed {model_name}'
+            else:
+                install_jobs[job_id]['status'] = 'failed'
+                install_jobs[job_id]['error'] = f'Installation failed for {model_name}'
+            
+    except InterruptedError as e:
+        with install_lock:
+            install_jobs[job_id]['status'] = 'cancelled'
+            install_jobs[job_id]['error'] = str(e)
+    except Exception as e:
+        with install_lock:
+            # Check if it was actually cancelled before marking as failed
+            if install_jobs[job_id].get('cancelled', False):
+                install_jobs[job_id]['status'] = 'cancelled'
+                install_jobs[job_id]['error'] = 'Installation cancelled by user'
+            else:
+                install_jobs[job_id]['status'] = 'failed'
+                install_jobs[job_id]['error'] = str(e)
+
+
+def run_huggingface_install_job(job_id: str, model_name: str, modelfile_content: str, file_url: str, gguf_filename: str):
+    """Run HuggingFace model installation in background thread."""
+    with install_lock:
+        install_jobs[job_id]['status'] = 'running'
+        install_jobs[job_id]['progress'] = 'Starting download...'
+    
+    # Progress callback
+    def update_progress(message):
+        with install_lock:
+            install_jobs[job_id]['progress'] = message
+    
+    # Cancellation callback
+    def should_cancel():
+        with install_lock:
+            return install_jobs[job_id].get('cancelled', False)
+    
+    temp_gguf = None
+    temp_modelfile = None
+    
+    try:
+        # Create temp files
+        import tempfile
+        temp_gguf = tempfile.NamedTemporaryFile(suffix='.gguf', delete=False)
+        temp_gguf.close()
+        gguf_path = temp_gguf.name
+        
+        temp_modelfile = tempfile.NamedTemporaryFile(mode='w', suffix='.Modelfile', delete=False)
+        temp_modelfile.write(modelfile_content)
+        temp_modelfile.close()
+        modelfile_path = temp_modelfile.name
+        
+        # Use existing download_file function with callbacks
+        hf_install_module.download_file(file_url, gguf_path, gguf_filename, should_cancel, update_progress)
+        
+        # Use existing create_ollama_model function
+        hf_install_module.create_ollama_model(modelfile_path, gguf_path, model_name)
+        
+        # Save Modelfile to repo
+        normalized_name = model_name.replace(':', '-')
+        final_modelfile_path = os.path.join(app.config['MODELFILE_REPO'], f"{normalized_name}.Modelfile")
+        os.makedirs(os.path.dirname(final_modelfile_path), exist_ok=True)
+        with open(final_modelfile_path, 'w') as f:
+            f.write(modelfile_content)
+        
+        with install_lock:
+            install_jobs[job_id]['status'] = 'completed'
+            install_jobs[job_id]['model_name'] = model_name
+            install_jobs[job_id]['progress'] = f'Successfully created {model_name}'
+            
+    except InterruptedError as e:
+        with install_lock:
+            install_jobs[job_id]['status'] = 'cancelled'
+            install_jobs[job_id]['error'] = 'Installation cancelled by user'
+    except Exception as e:
+        with install_lock:
+            if install_jobs[job_id].get('cancelled', False):
+                install_jobs[job_id]['status'] = 'cancelled'
+                install_jobs[job_id]['error'] = 'Installation cancelled by user'
+            else:
+                install_jobs[job_id]['status'] = 'failed'
+                install_jobs[job_id]['error'] = str(e)
+    finally:
+        # Clean up temp files
+        if temp_gguf and os.path.exists(temp_gguf.name):
+            os.unlink(temp_gguf.name)
+        if temp_modelfile and os.path.exists(temp_modelfile.name):
+            os.unlink(temp_modelfile.name)
+
+
+# ===== WEB ROUTES =====
+
+@app.route('/')
+def index():
+    """Main page."""
+    return render_template('index.html')
+
+
+@app.route('/api/status')
+def api_status():
+    """Get real-time system status and running models."""
+    # Get system metrics
+    cpu_load, ram_used, ram_total = get_sys_metrics()
+    vram_used, vram_total, gpu_load = get_gpu_metrics()
+    
+    # Get running models from /api/ps
+    running_models = []
+    try:
+        url = f"{get_ollama_url()}/api/ps"
+        with urllib.request.urlopen(url, timeout=2) as response:
+            ps_data = json.loads(response.read().decode())
+            for model in ps_data.get('models', []):
+                size_vram = model.get('size_vram', 0) / (1024**3)  # GB
+                size_total = model.get('size', 0) / (1024**3)  # GB
+                offload_pct = ((size_total - size_vram) / size_total * 100) if size_total > 0 else 0
+                
+                running_models.append({
+                    'name': model.get('name', 'Unknown'),
+                    'size_gb': size_total,
+                    'vram_gb': size_vram,
+                    'offload_pct': offload_pct,
+                    'expires_at': model.get('expires_at', '')
+                })
+    except Exception as e:
+        print(f"Error getting running models: {e}")
+    
+    return jsonify({
+        'cpu_load': round(cpu_load, 2),
+        'ram_used_mb': ram_used,
+        'ram_total_mb': ram_total,
+        'ram_used_pct': round((ram_used / ram_total * 100) if ram_total > 0 else 0, 1),
+        'vram_used_mb': round(vram_used) if vram_used is not None else None,
+        'vram_total_mb': round(vram_total) if vram_total is not None else None,
+        'vram_used_pct': round((vram_used / vram_total * 100) if vram_total and vram_total > 0 else 0, 1),
+        'gpu_load': gpu_load,
+        'running_models': running_models
+    })
+
+
+@app.route('/api/models')
+def api_models():
+    """Get list of all installed models and available modelfiles."""
+    try:
+        result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, check=True)
+        
+        installed_models = []
+        installed_names = set()
+        
+        for line in result.stdout.strip().split('\n')[1:]:  # Skip header
+            if line.strip():
+                parts = line.split()
+                if len(parts) >= 3:
+                    name = parts[0]
+                    installed_names.add(name)
+                    model_id = parts[1] if len(parts) > 1 else ''
+                    size = parts[2] if len(parts) > 2 else 'N/A'
+                    
+                    # Check if modelfile exists
+                    modelfile_path = check_modelfile_exists(name)
+                    has_modelfile = modelfile_path is not None
+                    
+                    # Get detailed info
+                    detailed_info = get_model_info_detailed(name)
+                    
+                    installed_models.append({
+                        'name': name,
+                        'id': model_id,
+                        'size': size,
+                        'installed': True,
+                        'has_modelfile': has_modelfile,
+                        'modelfile_path': modelfile_path,
+                        'family': detailed_info['family'],
+                        'params': detailed_info['params'],
+                        'quant': detailed_info['quant'],
+                        'max_context': detailed_info['max_context'],
+                        'context_used': detailed_info['context_used'],
+                        'capabilities': detailed_info['capabilities'],
+                        'vram_estimate': detailed_info['vram_estimate']
+                    })
+        
+        # Get all modelfiles
+        all_modelfiles = get_all_modelfiles()
+        available_modelfiles = []
+        
+        for mf in all_modelfiles:
+            # Check if this modelfile's model is already installed
+            if mf['model_name'] not in installed_names:
+                available_modelfiles.append({
+                    'name': mf['model_name'],
+                    'installed': False,
+                    'has_modelfile': True,
+                    'modelfile_path': mf['path'],
+                    'hf_upstream': mf['hf_upstream'],
+                    'quantization': mf['quantization'],
+                    'family': mf.get('family', 'Unknown'),
+                    'params': mf.get('params', 'Unknown'),
+                    'quant': mf['quantization'],
+                    'max_context': mf.get('num_ctx', 0),
+                    'context_used': 0,
+                    'capabilities': mf.get('capabilities', []),
+                    'vram_estimate': 'N/A',
+                    'size': 'Not installed'
+                })
+        
+        # Combine installed and available
+        all_models = installed_models + available_modelfiles
+        
+        return jsonify({
+            'models': all_models,
+            'installed_count': len(installed_models),
+            'available_count': len(available_modelfiles)
+        })
+    except subprocess.CalledProcessError as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/model/<path:model_name>')
+def api_model_detail(model_name):
+    """Get detailed information about a specific model."""
+    info = get_model_info_detailed(model_name)
+    modelfile_path = check_modelfile_exists(model_name)
+    
+    return jsonify({
+        'info': info,
+        'has_modelfile': modelfile_path is not None,
+        'modelfile_path': modelfile_path
+    })
+
+
+@app.route('/api/modelfile/<path:model_name>')
+def api_get_modelfile(model_name):
+    """Get the Modelfile content for a model."""
+    modelfile_path = check_modelfile_exists(model_name)
+    
+    if not modelfile_path or not os.path.exists(modelfile_path):
+        return jsonify({'error': 'Modelfile not found'}), 404
+    
+    try:
+        with open(modelfile_path, 'r') as f:
+            content = f.read()
+        return jsonify({
+            'path': modelfile_path,
+            'content': content
+        })
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/modelfile/<path:model_name>', methods=['POST'])
+def api_save_modelfile(model_name):
+    """Save Modelfile content and optionally recreate the model."""
+    data = request.get_json()
+    content = data.get('content', '')
+    recreate_model = data.get('recreate_model', False)
+    
+    if not content:
+        return jsonify({'error': 'No content provided'}), 400
+    
+    # Determine the modelfile path
+    modelfile_path = check_modelfile_exists(model_name)
+    
+    if not modelfile_path:
+        # Create new Modelfile
+        normalized_name = model_name.replace(':', '-')
+        modelfile_path = os.path.join(app.config['MODELFILE_REPO'], f"{normalized_name}.Modelfile")
+    
+    try:
+        # Ensure directory exists
+        os.makedirs(os.path.dirname(modelfile_path), exist_ok=True)
+        
+        # Save the modelfile
+        with open(modelfile_path, 'w') as f:
+            f.write(content)
+        
+        # If there are changes, start background job to recreate the model
+        if recreate_model:
+            # Create job ID
+            job_id = f"recreate_{int(time.time() * 1000)}"
+            
+            # Initialize job state
+            with install_lock:
+                install_jobs[job_id] = {
+                    'status': 'queued',
+                    'progress': 'Queued for recreation',
+                    'modelfile_path': modelfile_path,
+                    'model_name': model_name,
+                    'error': None,
+                    'cancelled': False
+                }
+            
+            # Start background thread
+            thread = threading.Thread(target=run_install_job, args=(job_id, modelfile_path))
+            thread.daemon = True
+            thread.start()
+            
+            return jsonify({
+                'success': True,
+                'path': modelfile_path,
+                'job_id': job_id,
+                'recreating': True
+            })
+        
+        return jsonify({
+            'success': True,
+            'path': modelfile_path,
+            'recreating': False
+        })
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/model/<path:model_name>', methods=['DELETE'])
+def api_delete_model(model_name):
+    """Delete a model."""
+    try:
+        subprocess.run(['ollama', 'rm', model_name], check=True, capture_output=True)
+        return jsonify({'success': True})
+    except subprocess.CalledProcessError as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/install/ollama', methods=['POST'])
+def api_install_ollama_model():
+    """Install a model from Ollama library."""
+    data = request.get_json()
+    model_name = data.get('model_name', '')
+    
+    if not model_name:
+        return jsonify({'error': 'No model name provided'}), 400
+    
+    try:
+        # Run ollama pull in background
+        process = subprocess.Popen(
+            ['ollama', 'pull', model_name],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True
+        )
+        
+        # Read output
+        stdout, stderr = process.communicate()
+        
+        if process.returncode == 0:
+            return jsonify({
+                'success': True,
+                'message': f'Successfully pulled {model_name}',
+                'output': stdout
+            })
+        else:
+            return jsonify({
+                'error': f'Failed to pull model: {stderr}'
+            }), 500
+            
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/install/huggingface', methods=['POST'])
+def api_install_huggingface():
+    """Process HuggingFace URL and return Modelfile skeleton or list of GGUF files."""
+    data = request.get_json()
+    hf_url = data.get('url', '')
+    selected_file = data.get('selected_file', None)  # For when user selects from dropdown
+    
+    if not hf_url:
+        return jsonify({'error': 'No URL provided'}), 400
+    
+    try:
+        # Parse the URL
+        parsed = urlparse(hf_url)
+        path_parts = parsed.path.strip('/').split('/')
+        
+        if len(path_parts) < 2:
+            return jsonify({'error': 'Invalid HuggingFace URL'}), 400
+        
+        org = path_parts[0]
+        repo = path_parts[1]
+        
+        # Check if it's a direct GGUF file link
+        if hf_url.endswith('.gguf') or '/blob/' in hf_url or '/resolve/' in hf_url:
+            # Direct GGUF file URL
+            gguf_filename = os.path.basename(parsed.path)
+            file_url = hf_url.replace('/blob/', '/resolve/')
+            
+            return generate_modelfile_response(org, repo, gguf_filename, file_url)
+            
+        elif selected_file:
+            # User selected a file from dropdown
+            file_url = f"https://huggingface.co/{org}/{repo}/resolve/main/{selected_file}"
+            
+            return generate_modelfile_response(org, repo, selected_file, file_url)
+            
+        else:
+            # Repository root - fetch available GGUF files
+            api_url = f"https://huggingface.co/api/models/{org}/{repo}"
+            
+            with urllib.request.urlopen(api_url, timeout=10) as response:
+                model_data = json.loads(response.read().decode())
+            
+            # Extract GGUF files from siblings
+            gguf_files = []
+            for sibling in model_data.get('siblings', []):
+                filename = sibling.get('rfilename', '')
+                if filename.lower().endswith('.gguf'):
+                    size_bytes = sibling.get('size', 0)
+                    size_gb = size_bytes / (1024**3) if size_bytes else 0
+                    gguf_files.append({
+                        'filename': filename,
+                        'size': f"{size_gb:.2f} GB" if size_gb > 0 else "Unknown size"
+                    })
+            
+            if not gguf_files:
+                return jsonify({
+                    'error': f'No GGUF files found in repository {org}/{repo}'
+                }), 404
+            
+            # Return list of files for user to choose from
+            return jsonify({
+                'success': True,
+                'requires_selection': True,
+                'org': org,
+                'repo': repo,
+                'repo_url': f"https://huggingface.co/{org}/{repo}",
+                'gguf_files': gguf_files
+            })
+        
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+def generate_modelfile_response(org: str, repo: str, gguf_filename: str, file_url: str):
+    """Generate modelfile from GGUF filename using same logic as hf-llm-install.py."""
+    try:
+        # Use shared parsing function from hf-llm-install.py
+        model_base, tag, full_name = hf_install_module.parse_model_name_from_gguf(gguf_filename)
+        
+        # Extract quantization for metadata
+        quant_match = re.search(r'[._-](Q[0-9]+_[KLM0-9]+(?:_[LSM])?)', gguf_filename, re.IGNORECASE)
+        quantization = quant_match.group(1).upper() if quant_match else 'unspecified'
+        
+        # Create Modelfile skeleton with relative path (like CLI does)
+        modelfile_content = f"""# Modelfile for {full_name}
+# hf_upstream: {file_url}
+# quantization: {quantization}
+# capabilities: tools
+# sha256: <add_sha256_checksum_here>
+
+FROM ./{gguf_filename}
+
+# System prompt - customize for your use case
+SYSTEM \"\"\"You are a helpful AI assistant.\"\"\"
+
+# Parameters - refer to manufacturer's recommendations
+# https://huggingface.co/{org}/{repo}
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+PARAMETER top_k 40
+PARAMETER num_ctx 8192
+PARAMETER repeat_penalty 1.1
+PARAMETER stop "<|im_end|>"
+PARAMETER stop "<|end|>"
+PARAMETER stop "</s>"
+
+# Template - adjust based on model's chat template
+TEMPLATE \"\"\"{{{{ if .System }}}}<|im_start|>system
+{{{{ .System }}}}<|im_end|>
+{{{{ end }}}}{{{{ if .Prompt }}}}<|im_start|>user
+{{{{ .Prompt }}}}<|im_end|>
+{{{{ end }}}}<|im_start|>assistant
+{{{{ .Response }}}}<|im_end|>
+\"\"\"
+"""
+        
+        return jsonify({
+            'success': True,
+            'requires_selection': False,
+            'model_name': model_base,
+            'tag': tag,
+            'full_name': full_name,
+            'gguf_filename': gguf_filename,
+            'file_url': file_url,
+            'repo_url': f"https://huggingface.co/{org}/{repo}",
+            'modelfile_content': modelfile_content
+        })
+        
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/install/huggingface/create', methods=['POST'])
+def api_create_from_modelfile():
+    """Start HuggingFace model creation as background job."""
+    data = request.get_json()
+    model_name = data.get('model_name', '').strip()
+    modelfile_content = data.get('modelfile_content', '')
+    file_url = data.get('file_url', '')
+    gguf_filename = data.get('gguf_filename', '')
+    
+    if not model_name or not modelfile_content or not file_url:
+        return jsonify({'error': 'Missing required parameters'}), 400
+    
+    try:
+        # Create job ID
+        job_id = f"hf_install_{int(time.time() * 1000)}"
+        
+        # Initialize job state
+        with install_lock:
+            install_jobs[job_id] = {
+                'status': 'queued',
+                'progress': 'Queued for download',
+                'model_name': model_name,
+                'error': None,
+                'cancelled': False
+            }
+        
+        # Start background thread
+        thread = threading.Thread(
+            target=run_huggingface_install_job,
+            args=(job_id, model_name, modelfile_content, file_url, gguf_filename)
+        )
+        thread.daemon = True
+        thread.start()
+        
+        return jsonify({
+            'success': True,
+            'job_id': job_id,
+            'message': 'Installation started'
+        })
+        
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/install/modelfile', methods=['POST'])
+def api_install_from_modelfile():
+    """Start installation of a model from an existing Modelfile as background job."""
+    try:
+        data = request.get_json()
+        modelfile_path = data.get('modelfile_path', '')
+        
+        if not modelfile_path:
+            return jsonify({'error': 'No modelfile path provided'}), 400
+        
+        if not os.path.exists(modelfile_path):
+            return jsonify({'error': 'Modelfile not found'}), 404
+        
+        # Create job ID
+        job_id = f"install_{int(time.time() * 1000)}"
+        
+        # Initialize job state
+        with install_lock:
+            install_jobs[job_id] = {
+                'status': 'queued',
+                'progress': 'Queued for installation',
+                'modelfile_path': modelfile_path,
+                'model_name': None,
+                'error': None,
+                'cancelled': False
+            }
+        
+        # Start background thread
+        thread = threading.Thread(target=run_install_job, args=(job_id, modelfile_path))
+        thread.daemon = True
+        thread.start()
+        
+        return jsonify({
+            'success': True,
+            'job_id': job_id,
+            'message': 'Installation started'
+        })
+        
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+@app.route('/api/install/status/<job_id>', methods=['GET'])
+def api_install_status(job_id):
+    """Get status of an installation job."""
+    with install_lock:
+        if job_id not in install_jobs:
+            return jsonify({'error': 'Job not found'}), 404
+        
+        job = install_jobs[job_id].copy()
+    
+    return jsonify({
+        'status': job['status'],
+        'progress': job['progress'],
+        'model_name': job.get('model_name'),
+        'error': job.get('error')
+    })
+
+
+@app.route('/api/install/active', methods=['GET'])
+def api_install_active():
+    """Get all active (running or queued) installation jobs."""
+    with install_lock:
+        active = {}
+        for job_id, job in install_jobs.items():
+            if job['status'] in ['queued', 'running']:
+                active[job_id] = {
+                    'status': job['status'],
+                    'modelfile_path': job['modelfile_path'],
+                    'model_name': job.get('model_name')
+                }
+        return jsonify(active)
+
+
+@app.route('/api/install/cancel/<job_id>', methods=['POST'])
+def api_install_cancel(job_id):
+    """Cancel an installation job."""
+    with install_lock:
+        if job_id not in install_jobs:
+            return jsonify({'error': 'Job not found'}), 404
+        
+        if install_jobs[job_id]['status'] in ['completed', 'failed', 'cancelled']:
+            return jsonify({'error': 'Job already finished'}), 400
+        
+        install_jobs[job_id]['cancelled'] = True
+    
+    return jsonify({'success': True})
+
+
+@app.route('/api/performance/vram-test/<path:model_name>', methods=['POST'])
+def api_vram_test(model_name):
+    """Test VRAM usage for a specific model or all models."""
+    try:
+        import time
+        
+        # Check if testing all models
+        if model_name == '_all_':
+            # Get all installed models
+            result = subprocess.run(['ollama', 'list'], capture_output=True, text=True, check=True)
+            models_to_test = []
+            for line in result.stdout.strip().split('\n')[1:]:
+                if line.strip():
+                    parts = line.split()
+                    if len(parts) >= 1:
+                        models_to_test.append(parts[0])
+            
+            # Test each model
+            results = []
+            for model in models_to_test:
+                result = test_single_model_vram(model)
+                results.append(result)
+                time.sleep(0.5)  # Brief pause between tests
+            
+            return jsonify({
+                'success': True,
+                'results': results
+            })
+        else:
+            # Test single model
+            result = test_single_model_vram(model_name)
+            return jsonify(result)
+        
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+
+
+def test_single_model_vram(model_name: str) -> Dict:
+    """Test VRAM usage for a single model. Uses vram-test.py logic."""
+    # Use the existing test_model_vram function from vram-test.py
+    return vram_test_module.test_model_vram(model_name)
+
+
+@app.route('/api/performance/optimize/<path:model_name>', methods=['POST'])
+def api_optimize_context(model_name):
+    """Run context optimizer for a specific model. Uses context-optimizer.py logic."""
+    try:
+        # Get parameters from request
+        data = request.get_json() or {}
+        overhead_gb = float(data.get('overhead_gb', 1.0))
+        max_turns = int(data.get('max_turns', 20))
+        
+        # Use the existing find_optimal_context function from context-optimizer.py
+        result = context_optimizer_module.find_optimal_context(model_name, max_turns=max_turns, overhead_gb=overhead_gb)
+        
+        if not result or 'results' not in result:
+            return jsonify({
+                'success': False,
+                'error': 'Optimization failed or no results returned'
+            })
+        
+        # Extract data from results
+        test_results = []
+        optimal_context = 0
+        
+        for r in result.get('results', []):
+            test_results.append({
+                'context_size': r.get('num_ctx', 0),
+                'vram_gb': round(r.get('vram_gb', 0), 2),
+                'offload_pct': round(r.get('offload_pct', 0), 1),
+                'fits': r.get('offload_pct', 100) == 0
+            })
+            
+            # Track optimal (largest that fits)
+            if r.get('offload_pct', 100) == 0:
+                optimal_context = max(optimal_context, r.get('num_ctx', 0))
+        
+        # Get VRAM info
+        vram_total, vram_available = context_optimizer_module.get_gpu_vram()
+        
+        return jsonify({
+            'success': True,
+            'model': model_name,
+            'max_context': result.get('max_context', 0),
+            'current_context': result.get('current_ctx', 0),
+            'optimal_context': result.get('recommended_ctx', optimal_context),
+            'available_vram_gb': round(vram_available, 2) if vram_available else 0,
+            'results': test_results,
+            'summary': result.get('summary', '')
+        })
+        
+    except Exception as e:
+        return jsonify({
+            'success': False,
+            'error': str(e)
+        }), 500
+
+
+if __name__ == '__main__':
+    print("Starting Ollama-Utils Web Interface...")
+    print("Access the interface at: http://localhost:5000")
+    app.run(host='0.0.0.0', port=5000, debug=True)