mstoeck3/ollama-utils

SHA256

Fork 0

Files

History

SERVICE GPGPU 6c7f96145b add rerank endpoint

2026-01-20 20:44:48 +00:00

api.py

add rerank endpoint

2026-01-20 20:44:48 +00:00

README.md

add rerank endpoint

2026-01-20 20:44:48 +00:00

requirements.txt

add rerank endpoint

2026-01-20 20:44:48 +00:00

README.md

Ollama Reranker Workaround

⚠️ Important: This is a workaround/hack, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution.

A FastAPI service that provides document reranking using Ollama's embedding endpoint. This exists because Ollama does not natively support a /api/rerank endpoint for cross-encoder reranker models.

The Problem

Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However:

Ollama has no /api/rerank endpoint - reranker models can't be used as intended
/api/embeddings - returns embeddings, not classification scores
/api/generate - reranker models can't generate text (they output uniform scores like 0.5)

The Workaround

This service uses a magnitude-based approach:

Concatenates query and document in cross-encoder format: "Query: {query}\n\nDocument: {doc}\n\nRelevance:"
Gets embedding vector from Ollama's /api/embeddings endpoint
Calculates the L2 norm (magnitude) of the embedding vector
Key discovery: For BGE-reranker-v2-m3, lower magnitude = more relevant
Inverts and normalizes to 0-1 range where higher score = more relevant

Why This Works (Sort Of)

When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is:

Not documented behavior
Not guaranteed across models
Not the intended use of the embedding endpoint
Less accurate than proper cross-encoder scoring

But it's the only way to use reranker models with Ollama right now.

Limitations

⚠️ Critical Limitations

Model-Specific Behavior
- Magnitude ranges differ between models (BGE: 15-28, others: unknown)
- Correlation direction may vary (lower/higher = more relevant)
- Requires manual calibration per model
No Theoretical Foundation
- Exploits accidental behavior, not designed functionality
- Could break with model updates
- No guarantee of correctness
Less Accurate Than Proper Methods
- Native cross-encoder scoring is more accurate
- sentence-transformers library is the gold standard
- This is a compromise for GPU scheduling benefits
Embedding Dimension Dependency
- Magnitude scales with dimensionality (384 vs 768 vs 1024)
- Models with different dimensions need different calibration
Performance
- Requires one API call per document (40 docs = 40 calls)
- Slower than native reranking would be
- Fast but not optimal

When To Use This

✅ Use if:

You need Ollama's GPU scheduling for multiple models
VRAM is constrained and you can't run separate services
You're okay with reduced accuracy vs sentence-transformers
You can tolerate model-specific calibration

❌ Don't use if:

You need reliable, production-grade reranking
You need cross-model consistency
You have VRAM for sentence-transformers (~200MB for reranker only)
Accuracy is critical

Installation

# Clone the repository
git clone https://github.com/yourusername/ollama-reranker-workaround.git
cd ollama-reranker-workaround

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Ensure Ollama is running with a reranker model
ollama pull qllama/bge-reranker-v2-m3

Usage

Start the Service

python api.py

The service runs on http://0.0.0.0:8080

API Request

curl -X POST http://localhost:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qllama/bge-reranker-v2-m3:latest",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of artificial intelligence.",
      "The weather today is sunny.",
      "Neural networks are used in deep learning."
    ],
    "top_n": 2
  }'

Response

{
  "results": [
    {
      "index": 0,
      "relevance_score": 0.9234,
      "document": "Machine learning is a subset of artificial intelligence."
    },
    {
      "index": 2,
      "relevance_score": 0.7845,
      "document": "Neural networks are used in deep learning."
    }
  ]
}

Configuration & Tunables

Model Calibration

The most critical parameters are in score_document_cross_encoder_workaround():

# Magnitude bounds (model-specific!)
typical_good_magnitude = 15.0   # Highly relevant documents
typical_poor_magnitude = 25.0   # Irrelevant documents

# For BGE-reranker-v2-m3, observed range is ~15-28
# Lower magnitude = more relevant (inverted correlation)

How to Calibrate for a New Model

Enable magnitude logging:

logger.info(f"Raw magnitude: {magnitude:.2f}")

Test with known relevant/irrelevant documents:

# Send queries with obviously relevant and irrelevant docs
# Observe magnitude ranges in logs

Determine correlation direction:
- If relevant docs have lower magnitudes → set invert = True
- If relevant docs have higher magnitudes → set invert = False

Set bounds:

# Find 90th percentile of relevant doc magnitudes
typical_good_magnitude = <observed_value>

# Find 10th percentile of irrelevant doc magnitudes  
typical_poor_magnitude = <observed_value>

Prompt Format Tuning

The concatenation format may affect results:

# Current format (works for BGE-reranker-v2-m3)
combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"

# Alternative formats to try:
combined = f"{query} [SEP] {doc}"
combined = f"query: {query} document: {doc}"
combined = f"<query>{query}</query><document>{doc}</document>"

Test different formats and check if score distributions improve.

Concurrency Settings

# In the rerank() endpoint
# Process all documents concurrently (default)
tasks = [score_document(...) for doc in documents]
results = await asyncio.gather(*tasks)

# Or batch for rate limiting:
batch_size = 10
for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    # process batch...

Technical Details

Magnitude Calculation

import numpy as np

# Get embedding from Ollama
embedding = await get_embedding(client, model, combined_text)

# Calculate L2 norm (Euclidean length)
vec = np.array(embedding)
magnitude = float(np.linalg.norm(vec))
# magnitude = sqrt(sum(x_i^2 for all dimensions))

Score Normalization

# Linear interpolation (inverted for BGE-reranker-v2-m3)
score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)

# Clamp to [0, 1]
score = min(max(score, 0.0), 1.0)

Example Magnitude Distributions

From real queries to BGE-reranker-v2-m3:

Query: "Was ist eine Catalog Node ID?"

Highly relevant docs: magnitude ~15.30 - 15.98 → score 0.95-0.97
Moderately relevant:   magnitude ~17.00 - 19.00 → score 0.70-0.85
Weakly relevant:       magnitude ~20.00 - 24.00 → score 0.20-0.50
Irrelevant:           magnitude ~25.00 - 28.00 → score 0.00-0.10

Alternatives

1. Use sentence-transformers (Recommended for Production)

from sentence_transformers import CrossEncoder

model = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu')
scores = model.predict([(query, doc) for doc in documents])

Pros: Accurate, reliable, proper implementation
Cons: ~200MB VRAM/RAM, separate from Ollama

2. Request Ollama Feature

Open an issue on Ollama's GitHub requesting native /api/rerank support.

3. Use API Services

Services like Cohere, Jina AI, or Voyage AI offer reranking APIs.

Requirements

fastapi>=0.104.0
uvicorn>=0.24.0
httpx>=0.25.0
pydantic>=2.0.0
numpy>=1.24.0

Contributing

This is a workaround for a missing feature. Contributions welcome for:

Calibration configs for additional models
Auto-calibration logic
Alternative prompt formats
Better normalization strategies

But remember: The best contribution would be native Ollama support.

License

MIT

Disclaimer

This is an experimental workaround that exploits undocumented behavior. It is:

Not endorsed by Ollama or BAAI
Not guaranteed to work across models or versions
Not suitable for production use without extensive testing
A temporary solution until native reranking support exists

Use at your own risk and always validate results against ground truth.