add rerank endpoint

2026-01-20 20:44:48 +00:00
parent ccbe95ac1e
commit 6c7f96145b
3 changed files with 497 additions and 0 deletions
--- a/plugins/reranking-endpoint/README.md
+++ b/plugins/reranking-endpoint/README.md
@@ -0,0 +1,306 @@
+# Ollama Reranker Workaround
+
+> **⚠️ Important:** This is a **workaround/hack**, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution.
+
+A FastAPI service that provides document reranking using Ollama's embedding endpoint. This exists because Ollama does not natively support a `/api/rerank` endpoint for cross-encoder reranker models.
+
+## The Problem
+
+Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However:
+
+- **Ollama has no `/api/rerank` endpoint** - reranker models can't be used as intended
+- **`/api/embeddings`** - returns embeddings, not classification scores
+- **`/api/generate`** - reranker models can't generate text (they output uniform scores like 0.5)
+
+## The Workaround
+
+This service uses a magnitude-based approach:
+
+1. Concatenates query and document in cross-encoder format: `"Query: {query}\n\nDocument: {doc}\n\nRelevance:"`
+2. Gets embedding vector from Ollama's `/api/embeddings` endpoint
+3. Calculates the L2 norm (magnitude) of the embedding vector
+4. **Key discovery:** For BGE-reranker-v2-m3, **lower magnitude = more relevant**
+5. Inverts and normalizes to 0-1 range where higher score = more relevant
+
+### Why This Works (Sort Of)
+
+When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is:
+- **Not documented behavior**
+- **Not guaranteed across models**
+- **Not the intended use of the embedding endpoint**
+- **Less accurate than proper cross-encoder scoring**
+
+But it's the only way to use reranker models with Ollama right now.
+
+## Limitations
+
+### ⚠️ Critical Limitations
+
+1. **Model-Specific Behavior**
+   - Magnitude ranges differ between models (BGE: 15-28, others: unknown)
+   - Correlation direction may vary (lower/higher = more relevant)
+   - Requires manual calibration per model
+
+2. **No Theoretical Foundation**
+   - Exploits accidental behavior, not designed functionality
+   - Could break with model updates
+   - No guarantee of correctness
+
+3. **Less Accurate Than Proper Methods**
+   - Native cross-encoder scoring is more accurate
+   - sentence-transformers library is the gold standard
+   - This is a compromise for GPU scheduling benefits
+
+4. **Embedding Dimension Dependency**
+   - Magnitude scales with dimensionality (384 vs 768 vs 1024)
+   - Models with different dimensions need different calibration
+
+5. **Performance**
+   - Requires one API call per document (40 docs = 40 calls)
+   - Slower than native reranking would be
+   - Fast but not optimal
+
+## When To Use This
+
+✅ **Use if:**
+- You need Ollama's GPU scheduling for multiple models
+- VRAM is constrained and you can't run separate services
+- You're okay with reduced accuracy vs sentence-transformers
+- You can tolerate model-specific calibration
+
+❌ **Don't use if:**
+- You need reliable, production-grade reranking
+- You need cross-model consistency
+- You have VRAM for sentence-transformers (~200MB for reranker only)
+- Accuracy is critical
+
+## Installation
+
+```bash
+# Clone the repository
+git clone https://github.com/yourusername/ollama-reranker-workaround.git
+cd ollama-reranker-workaround
+
+# Create virtual environment
+python3 -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Ensure Ollama is running with a reranker model
+ollama pull qllama/bge-reranker-v2-m3
+```
+
+## Usage
+
+### Start the Service
+
+```bash
+python api.py
+```
+
+The service runs on `http://0.0.0.0:8080`
+
+### API Request
+
+```bash
+curl -X POST http://localhost:8080/v1/rerank \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qllama/bge-reranker-v2-m3:latest",
+    "query": "What is machine learning?",
+    "documents": [
+      "Machine learning is a subset of artificial intelligence.",
+      "The weather today is sunny.",
+      "Neural networks are used in deep learning."
+    ],
+    "top_n": 2
+  }'
+```
+
+### Response
+
+```json
+{
+  "results": [
+    {
+      "index": 0,
+      "relevance_score": 0.9234,
+      "document": "Machine learning is a subset of artificial intelligence."
+    },
+    {
+      "index": 2,
+      "relevance_score": 0.7845,
+      "document": "Neural networks are used in deep learning."
+    }
+  ]
+}
+```
+
+## Configuration & Tunables
+
+### Model Calibration
+
+The most critical parameters are in `score_document_cross_encoder_workaround()`:
+
+```python
+# Magnitude bounds (model-specific!)
+typical_good_magnitude = 15.0   # Highly relevant documents
+typical_poor_magnitude = 25.0   # Irrelevant documents
+
+# For BGE-reranker-v2-m3, observed range is ~15-28
+# Lower magnitude = more relevant (inverted correlation)
+```
+
+### How to Calibrate for a New Model
+
+1. **Enable magnitude logging:**
+   ```python
+   logger.info(f"Raw magnitude: {magnitude:.2f}")
+   ```
+
+2. **Test with known relevant/irrelevant documents:**
+   ```python
+   # Send queries with obviously relevant and irrelevant docs
+   # Observe magnitude ranges in logs
+   ```
+
+3. **Determine correlation direction:**
+   - If relevant docs have **lower** magnitudes → set `invert = True`
+   - If relevant docs have **higher** magnitudes → set `invert = False`
+
+4. **Set bounds:**
+   ```python
+   # Find 90th percentile of relevant doc magnitudes
+   typical_good_magnitude = <observed_value>
+   
+   # Find 10th percentile of irrelevant doc magnitudes  
+   typical_poor_magnitude = <observed_value>
+   ```
+
+### Prompt Format Tuning
+
+The concatenation format may affect results:
+
+```python
+# Current format (works for BGE-reranker-v2-m3)
+combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
+
+# Alternative formats to try:
+combined = f"{query} [SEP] {doc}"
+combined = f"query: {query} document: {doc}"
+combined = f"<query>{query}</query><document>{doc}</document>"
+```
+
+Test different formats and check if score distributions improve.
+
+### Concurrency Settings
+
+```python
+# In the rerank() endpoint
+# Process all documents concurrently (default)
+tasks = [score_document(...) for doc in documents]
+results = await asyncio.gather(*tasks)
+
+# Or batch for rate limiting:
+batch_size = 10
+for i in range(0, len(documents), batch_size):
+    batch = documents[i:i+batch_size]
+    # process batch...
+```
+
+## Technical Details
+
+### Magnitude Calculation
+
+```python
+import numpy as np
+
+# Get embedding from Ollama
+embedding = await get_embedding(client, model, combined_text)
+
+# Calculate L2 norm (Euclidean length)
+vec = np.array(embedding)
+magnitude = float(np.linalg.norm(vec))
+# magnitude = sqrt(sum(x_i^2 for all dimensions))
+```
+
+### Score Normalization
+
+```python
+# Linear interpolation (inverted for BGE-reranker-v2-m3)
+score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
+
+# Clamp to [0, 1]
+score = min(max(score, 0.0), 1.0)
+```
+
+### Example Magnitude Distributions
+
+From real queries to BGE-reranker-v2-m3:
+
+```
+Query: "Was ist eine Catalog Node ID?"
+
+Highly relevant docs: magnitude ~15.30 - 15.98 → score 0.95-0.97
+Moderately relevant:   magnitude ~17.00 - 19.00 → score 0.70-0.85
+Weakly relevant:       magnitude ~20.00 - 24.00 → score 0.20-0.50
+Irrelevant:           magnitude ~25.00 - 28.00 → score 0.00-0.10
+```
+
+## Alternatives
+
+### 1. Use sentence-transformers (Recommended for Production)
+
+```python
+from sentence_transformers import CrossEncoder
+
+model = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu')
+scores = model.predict([(query, doc) for doc in documents])
+```
+
+**Pros:** Accurate, reliable, proper implementation  
+**Cons:** ~200MB VRAM/RAM, separate from Ollama
+
+### 2. Request Ollama Feature
+
+Open an issue on [Ollama's GitHub](https://github.com/ollama/ollama) requesting native `/api/rerank` support.
+
+### 3. Use API Services
+
+Services like Cohere, Jina AI, or Voyage AI offer reranking APIs.
+
+## Requirements
+
+```
+fastapi>=0.104.0
+uvicorn>=0.24.0
+httpx>=0.25.0
+pydantic>=2.0.0
+numpy>=1.24.0
+```
+
+## Contributing
+
+This is a workaround for a missing feature. Contributions welcome for:
+- Calibration configs for additional models
+- Auto-calibration logic
+- Alternative prompt formats
+- Better normalization strategies
+
+But remember: **The best contribution would be native Ollama support.**
+
+## License
+
+MIT
+
+## Disclaimer
+
+This is an **experimental workaround** that exploits undocumented behavior. It is:
+- Not endorsed by Ollama or BAAI
+- Not guaranteed to work across models or versions
+- Not suitable for production use without extensive testing
+- A temporary solution until native reranking support exists
+
+**Use at your own risk and always validate results against ground truth.**
--- a/plugins/reranking-endpoint/api.py
+++ b/plugins/reranking-endpoint/api.py
@@ -0,0 +1,186 @@
+import asyncio
+import httpx
+import logging
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import List, Optional
+import numpy as np
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+app = FastAPI(title="Ollama BGE Reranker (Working Workaround)")
+
+class RerankRequest(BaseModel):
+    model: str
+    query: str
+    documents: List[str]
+    top_n: Optional[int] = 3
+
+class RerankResult(BaseModel):
+    index: int
+    relevance_score: float
+    document: Optional[str] = None
+
+class RerankResponse(BaseModel):
+    results: List[RerankResult]
+
+async def get_embedding(
+    client: httpx.AsyncClient,
+    model: str,
+    text: str
+) -> Optional[List[float]]:
+    """Get embedding from Ollama."""
+    url = "http://localhost:11434/api/embeddings"
+    
+    try:
+        response = await client.post(
+            url,
+            json={"model": model, "prompt": text},
+            timeout=30.0
+        )
+        response.raise_for_status()
+        return response.json().get("embedding")
+    except Exception as e:
+        logger.error(f"Error getting embedding: {e}")
+        return None
+
+async def score_document_cross_encoder_workaround(
+    client: httpx.AsyncClient,
+    model: str,
+    query: str,
+    doc: str,
+    index: int
+) -> dict:
+    """
+    Workaround for using BGE-reranker with Ollama.
+    Based on: https://medium.com/@rosgluk/reranking-documents-with-ollama-and-qwen3-reranker-model-in-go-6dc9c2fb5f0b
+    
+    Key discovery: When using concatenated query+doc embeddings,
+    LOWER magnitude = MORE relevant. We invert the scores so that
+    higher values = more relevant (standard convention).
+    
+    Steps:
+    1. Concatenate query and document in cross-encoder format
+    2. Get embedding of the concatenated text
+    3. Calculate magnitude (lower = more relevant)
+    4. Invert and normalize to 0-1 (higher = more relevant)
+    """
+    
+    # Format as cross-encoder input
+    # The format matters - reranker models expect specific patterns
+    combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
+    
+    # Get embedding
+    embedding = await get_embedding(client, model, combined)
+    
+    if embedding is None:
+        logger.warning(f"Failed to get embedding for document {index}")
+        return {
+            "index": index,
+            "relevance_score": 0.0,
+            "document": doc
+        }
+    
+    # Calculate magnitude (L2 norm) of the embedding vector
+    vec = np.array(embedding)
+    magnitude = float(np.linalg.norm(vec))
+    
+    # CRITICAL DISCOVERY: For BGE-reranker via Ollama embeddings:
+    # LOWER magnitude = MORE relevant document
+    # Observed range: ~15-25 (lower = better)
+    
+    # Invert and normalize to 0-1 where higher score = more relevant
+    # Adjusted bounds based on empirical observations
+    typical_good_magnitude = 15.0   # Highly relevant documents
+    typical_poor_magnitude = 25.0   # Irrelevant documents
+    
+    # Linear interpolation (inverted)
+    # magnitude 15 → score ~0.9
+    # magnitude 25 → score ~0.0
+    score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
+    
+    # Clamp to 0-1 range
+    score = min(max(score, 0.0), 1.0)
+    
+    logger.debug(f"Doc {index}: magnitude={magnitude:.2f}, score={score:.4f}")
+    logger.info(f"Raw magnitude: {magnitude:.2f}")
+
+    return {
+        "index": index,
+        "relevance_score": score,
+        "document": doc
+    }
+
+@app.on_event("startup")
+async def check_ollama():
+    """Verify Ollama is accessible on startup."""
+    try:
+        async with httpx.AsyncClient() as client:
+            response = await client.get("http://localhost:11434/api/tags", timeout=5.0)
+            response.raise_for_status()
+            logger.info("✓ Successfully connected to Ollama")
+            logger.warning("⚠️  Using workaround: concatenation + magnitude")
+            logger.warning("⚠️  This is less accurate than proper cross-encoder usage")
+    except Exception as e:
+        logger.error(f"✗ Cannot connect to Ollama: {e}")
+
+@app.post("/v1/rerank", response_model=RerankResponse)
+async def rerank(request: RerankRequest):
+    """
+    Rerank documents using BGE-reranker via Ollama workaround.
+    
+    NOTE: This uses a workaround (magnitude of concatenated embeddings)
+    because Ollama doesn't expose BGE's classification head.
+    For best accuracy, use sentence-transformers directly.
+    """
+    if not request.documents:
+        raise HTTPException(status_code=400, detail="No documents provided")
+    
+    logger.info(f"Reranking {len(request.documents)} documents (workaround method)")
+    logger.info(f"Query: {request.query[:100]}...")
+    
+    async with httpx.AsyncClient() as client:
+        # Score all documents concurrently
+        tasks = [
+            score_document_cross_encoder_workaround(
+                client, request.model, request.query, doc, i
+            )
+            for i, doc in enumerate(request.documents)
+        ]
+        results = await asyncio.gather(*tasks)
+        
+        # Sort by score DESCENDING (higher score = more relevant)
+        # Scores are now inverted, so higher = better
+        results.sort(key=lambda x: x["relevance_score"], reverse=True)
+        
+        # Log scores
+        top_scores = [f"{r['relevance_score']:.4f}" for r in results[:request.top_n]]
+        logger.info(f"Top {len(top_scores)} scores: {top_scores}")
+        
+        return {"results": results[:request.top_n]}
+
+@app.get("/health")
+def health_check():
+    """Health check endpoint."""
+    return {
+        "status": "healthy",
+        "service": "ollama-bge-reranker-workaround",
+        "note": "Using magnitude workaround - less accurate than native"
+    }
+
+if __name__ == "__main__":
+    import uvicorn
+    
+    logger.info("=" * 60)
+    logger.info("Ollama BGE Reranker - WORKAROUND Implementation")
+    logger.info("=" * 60)
+    logger.info("Using concatenation + magnitude method")
+    logger.info("This works but is less accurate than proper cross-encoders")
+    logger.info("Starting on: http://0.0.0.0:8080")
+    logger.info("=" * 60)
+    
+    uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info")
--- a/plugins/reranking-endpoint/requirements.txt
+++ b/plugins/reranking-endpoint/requirements.txt
@@ -0,0 +1,5 @@
+fastapi
+uvicorn
+httpx
+pydantic
+requests