rerank endpoint plugin

This commit is contained in:
2026-01-20 22:01:23 +01:00
parent 6c7f96145b
commit 8149ac8c8b
3 changed files with 119 additions and 53 deletions

View File

@@ -1,4 +1,4 @@
# Ollama Reranker Workaround
# Ollama Cross-Encoder Reranker Workaround
> **⚠️ Important:** This is a **workaround/hack**, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution.
@@ -6,12 +6,14 @@ A FastAPI service that provides document reranking using Ollama's embedding endp
## The Problem
Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However:
Cross-encoder reranker models (like BGE-reranker, Qwen3-Reranker, etc.) are designed to score query-document pairs for relevance. However:
- **Ollama has no `/api/rerank` endpoint** - reranker models can't be used as intended
- **`/api/embeddings`** - returns embeddings, not classification scores
- **`/api/embeddings`** - returns embeddings, not the classification head scores
- **`/api/generate`** - reranker models can't generate text (they output uniform scores like 0.5)
**Root Cause:** Cross-encoder models have a classification head that outputs relevance scores. Ollama only exposes the embedding layer, not the classification layer.
## The Workaround
This service uses a magnitude-based approach:
@@ -19,60 +21,92 @@ This service uses a magnitude-based approach:
1. Concatenates query and document in cross-encoder format: `"Query: {query}\n\nDocument: {doc}\n\nRelevance:"`
2. Gets embedding vector from Ollama's `/api/embeddings` endpoint
3. Calculates the L2 norm (magnitude) of the embedding vector
4. **Key discovery:** For BGE-reranker-v2-m3, **lower magnitude = more relevant**
4. **Key discovery:** For cross-encoder models, **lower magnitude = more relevant**
5. Inverts and normalizes to 0-1 range where higher score = more relevant
### Why This Works (Sort Of)
When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is:
- **Not documented behavior**
- **Not guaranteed across models**
- **Not the intended use of the embedding endpoint**
- **Less accurate than proper cross-encoder scoring**
When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate inversely with relevance. This pattern has been observed in:
- BGE-reranker models (BGE-reranker-v2-m3, etc.)
- Qwen3-Reranker models (Qwen3-Reranker-4B, etc.)
- Potentially other cross-encoder architectures
But it's the only way to use reranker models with Ollama right now.
**However, this is:**
- **Not documented behavior** - exploiting accidental correlation
- **Not guaranteed across all models** - each model may have different magnitude ranges
- **Not the intended use** - bypasses the classification head
- **Less accurate** - proper cross-encoder scoring would be significantly better
But it's currently the only way to use cross-encoder reranker models with Ollama.
## Limitations
### ⚠️ Critical Limitations
1. **Model-Specific Behavior**
- Magnitude ranges differ between models (BGE: 15-28, others: unknown)
- Correlation direction may vary (lower/higher = more relevant)
- Requires manual calibration per model
1. **Bypasses Classification Head**
- Cross-encoder models have a specialized classification layer for scoring
- Ollama only exposes the embedding layer, not the classification head
- We're using embedding magnitudes as a proxy, not the actual relevance scores
- **This is fundamentally wrong** - we're using the wrong layer of the model
2. **No Theoretical Foundation**
- Exploits accidental behavior, not designed functionality
- Could break with model updates
- No guarantee of correctness
2. **Model-Specific Behavior**
- Magnitude ranges differ between models:
- BGE-reranker-v2-m3: ~15-28 (lower = more relevant)
- Qwen3-Reranker: similar pattern observed
- Other models: unknown, requires testing
- Correlation direction may theoretically vary (though inverse correlation seems common)
- Requires manual calibration per model family
3. **Less Accurate Than Proper Methods**
- Native cross-encoder scoring is more accurate
- sentence-transformers library is the gold standard
- This is a compromise for GPU scheduling benefits
3. **No Theoretical Foundation**
- Exploits accidental correlation, not designed functionality
- No documentation or guarantees from model creators
- Could break with model updates or quantization changes
- No mathematical proof this approach is valid
4. **Embedding Dimension Dependency**
4. **Significantly Less Accurate**
- Proper cross-encoder classification head scoring would be far more accurate
- sentence-transformers library uses the models correctly (30-50% better accuracy expected)
- This workaround is a compromise for Ollama's GPU scheduling benefits
- **Not suitable for production** without extensive validation
5. **Embedding Dimension Dependency**
- Magnitude scales with dimensionality (384 vs 768 vs 1024)
- Models with different dimensions need different calibration
- Quantization (Q4 vs Q5 vs Q8) may affect magnitude distributions
5. **Performance**
6. **Performance Overhead**
- Requires one API call per document (40 docs = 40 calls)
- Slower than native reranking would be
- Fast but not optimal
- Slower than native reranking API would be
- Concurrent processing helps but still suboptimal
- No batching support in Ollama's embedding API
## When To Use This
**Use if:**
- You need Ollama's GPU scheduling for multiple models
- VRAM is constrained and you can't run separate services
- You're okay with reduced accuracy vs sentence-transformers
- You can tolerate model-specific calibration
- You're okay with **significantly reduced accuracy** vs proper cross-encoder usage
- You can tolerate model-specific calibration and testing
- You understand you're using the **wrong layer** of the model
- This is for experimentation, not production
**Don't use if:**
- You need reliable, production-grade reranking
- You need cross-model consistency
- You have VRAM for sentence-transformers (~200MB for reranker only)
- Accuracy is critical
- Accuracy is critical for your use case
- You need guaranteed correctness
- You're deploying to production without extensive validation
### Recommended Alternative
For production use, run sentence-transformers separately:
```python
from sentence_transformers import CrossEncoder
model = CrossEncoder('BAAI/bge-reranker-v2-m3')
scores = model.predict([(query, doc) for doc in documents])
```
This uses the classification head correctly and provides proper relevance scores.
## Installation
@@ -88,8 +122,11 @@ source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Ensure Ollama is running with a reranker model
# Ensure Ollama is running with a cross-encoder reranker model
# Examples:
ollama pull qllama/bge-reranker-v2-m3
# or
ollama pull dengcao/qwen3-reranker-4b
```
## Usage
@@ -149,8 +186,10 @@ The most critical parameters are in `score_document_cross_encoder_workaround()`:
typical_good_magnitude = 15.0 # Highly relevant documents
typical_poor_magnitude = 25.0 # Irrelevant documents
# For BGE-reranker-v2-m3, observed range is ~15-28
# Lower magnitude = more relevant (inverted correlation)
# For cross-encoder models (BGE, Qwen3-Reranker):
# Observed range: ~15-28
# Lower magnitude = more relevant (inverse correlation)
# MUST be calibrated per model family!
```
### How to Calibrate for a New Model
@@ -238,7 +277,7 @@ score = min(max(score, 0.0), 1.0)
### Example Magnitude Distributions
From real queries to BGE-reranker-v2-m3:
From real queries to **BGE-reranker-v2-m3** (your results may vary with other models):
```
Query: "Was ist eine Catalog Node ID?"
@@ -249,6 +288,8 @@ Weakly relevant: magnitude ~20.00 - 24.00 → score 0.20-0.50
Irrelevant: magnitude ~25.00 - 28.00 → score 0.00-0.10
```
**Note:** Qwen3-Reranker and other cross-encoder models will have different ranges. Always calibrate!
## Alternatives
### 1. Use sentence-transformers (Recommended for Production)
@@ -297,10 +338,15 @@ MIT
## Disclaimer
This is an **experimental workaround** that exploits undocumented behavior. It is:
- Not endorsed by Ollama or BAAI
- Not guaranteed to work across models or versions
- Not suitable for production use without extensive testing
- A temporary solution until native reranking support exists
This is an **experimental workaround** that exploits undocumented behavior and **uses the wrong layer of cross-encoder models**. It is:
- **Using embeddings instead of classification scores** - fundamentally incorrect approach
- Not endorsed by Ollama, BAAI, Alibaba (Qwen), or any model creator
- Not guaranteed to work across models, versions, or quantization levels
- Not suitable for production use without extensive testing and validation
- A temporary hack until Ollama adds native `/api/rerank` support
- Significantly less accurate than proper cross-encoder usage
**Use at your own risk and always validate results against ground truth.**
For production systems, use sentence-transformers or dedicated reranking APIs that access the classification head properly.