add rerank endpoint

2026-01-20 20:44:48 +00:00
parent ccbe95ac1e
commit 6c7f96145b
3 changed files with 497 additions and 0 deletions
--- a/plugins/reranking-endpoint/README.md
+++ b/plugins/reranking-endpoint/README.md
@@ -0,0 +1,306 @@
 # Ollama Reranker Workaround
 > **⚠️ Important:** This is a **workaround/hack**, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution.
 A FastAPI service that provides document reranking using Ollama's embedding endpoint. This exists because Ollama does not natively support a `/api/rerank` endpoint for cross-encoder reranker models.
 ## The Problem
 Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However:
 - **Ollama has no `/api/rerank` endpoint** - reranker models can't be used as intended
 - **`/api/embeddings`** - returns embeddings, not classification scores
 - **`/api/generate`** - reranker models can't generate text (they output uniform scores like 0.5)
 ## The Workaround
 This service uses a magnitude-based approach:
 1. Concatenates query and document in cross-encoder format: `"Query: {query}\n\nDocument: {doc}\n\nRelevance:"`
 2. Gets embedding vector from Ollama's `/api/embeddings` endpoint
 3. Calculates the L2 norm (magnitude) of the embedding vector
 4. **Key discovery:** For BGE-reranker-v2-m3, **lower magnitude = more relevant**
 5. Inverts and normalizes to 0-1 range where higher score = more relevant
 ### Why This Works (Sort Of)
 When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is:
 - **Not documented behavior**
 - **Not guaranteed across models**
 - **Not the intended use of the embedding endpoint**
 - **Less accurate than proper cross-encoder scoring**
 But it's the only way to use reranker models with Ollama right now.
 ## Limitations
 ### ⚠️ Critical Limitations
 1. **Model-Specific Behavior**
   - Magnitude ranges differ between models (BGE: 15-28, others: unknown)
   - Correlation direction may vary (lower/higher = more relevant)
   - Requires manual calibration per model
 2. **No Theoretical Foundation**
   - Exploits accidental behavior, not designed functionality
   - Could break with model updates
   - No guarantee of correctness
 3. **Less Accurate Than Proper Methods**
   - Native cross-encoder scoring is more accurate
   - sentence-transformers library is the gold standard
   - This is a compromise for GPU scheduling benefits
 4. **Embedding Dimension Dependency**
   - Magnitude scales with dimensionality (384 vs 768 vs 1024)
   - Models with different dimensions need different calibration
 5. **Performance**
   - Requires one API call per document (40 docs = 40 calls)
   - Slower than native reranking would be
   - Fast but not optimal
 ## When To Use This
 ✅ **Use if:**
 - You need Ollama's GPU scheduling for multiple models
 - VRAM is constrained and you can't run separate services
 - You're okay with reduced accuracy vs sentence-transformers
 - You can tolerate model-specific calibration
 ❌ **Don't use if:**
 - You need reliable, production-grade reranking
 - You need cross-model consistency
 - You have VRAM for sentence-transformers (~200MB for reranker only)
 - Accuracy is critical
 ## Installation
 ```bash
 # Clone the repository
 git clone https://github.com/yourusername/ollama-reranker-workaround.git
 cd ollama-reranker-workaround
 # Create virtual environment
 python3 -m venv .venv
 source .venv/bin/activate  # On Windows: .venv\Scripts\activate
 # Install dependencies
 pip install -r requirements.txt
 # Ensure Ollama is running with a reranker model
 ollama pull qllama/bge-reranker-v2-m3
 ```
 ## Usage
 ### Start the Service
 ```bash
 python api.py
 ```
 The service runs on `http://0.0.0.0:8080`
 ### API Request
 ```bash
 curl -X POST http://localhost:8080/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qllama/bge-reranker-v2-m3:latest",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of artificial intelligence.",
      "The weather today is sunny.",
      "Neural networks are used in deep learning."
    ],
    "top_n": 2
  }'
 ```
 ### Response
 ```json
 {
  "results": [
    {
      "index": 0,
      "relevance_score": 0.9234,
      "document": "Machine learning is a subset of artificial intelligence."
    },
    {
      "index": 2,
      "relevance_score": 0.7845,
      "document": "Neural networks are used in deep learning."
    }
  ]
 }
 ```
 ## Configuration & Tunables
 ### Model Calibration
 The most critical parameters are in `score_document_cross_encoder_workaround()`:
 ```python
 # Magnitude bounds (model-specific!)
 typical_good_magnitude = 15.0   # Highly relevant documents
 typical_poor_magnitude = 25.0   # Irrelevant documents
 # For BGE-reranker-v2-m3, observed range is ~15-28
 # Lower magnitude = more relevant (inverted correlation)
 ```
 ### How to Calibrate for a New Model
 1. **Enable magnitude logging:**
   ```python
   logger.info(f"Raw magnitude: {magnitude:.2f}")
   ```
 2. **Test with known relevant/irrelevant documents:**
   ```python
   # Send queries with obviously relevant and irrelevant docs
   # Observe magnitude ranges in logs
   ```
 3. **Determine correlation direction:**
   - If relevant docs have **lower** magnitudes → set `invert = True`
   - If relevant docs have **higher** magnitudes → set `invert = False`
 4. **Set bounds:**
   ```python
   # Find 90th percentile of relevant doc magnitudes
   typical_good_magnitude = <observed_value>
   # Find 10th percentile of irrelevant doc magnitudes  
   typical_poor_magnitude = <observed_value>
   ```
 ### Prompt Format Tuning
 The concatenation format may affect results:
 ```python
 # Current format (works for BGE-reranker-v2-m3)
 combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
 # Alternative formats to try:
 combined = f"{query} [SEP] {doc}"
 combined = f"query: {query} document: {doc}"
 combined = f"<query>{query}</query><document>{doc}</document>"
 ```
 Test different formats and check if score distributions improve.
 ### Concurrency Settings
 ```python
 # In the rerank() endpoint
 # Process all documents concurrently (default)
 tasks = [score_document(...) for doc in documents]
 results = await asyncio.gather(*tasks)
 # Or batch for rate limiting:
 batch_size = 10
 for i in range(0, len(documents), batch_size):
    batch = documents[i:i+batch_size]
    # process batch...
 ```
 ## Technical Details
 ### Magnitude Calculation
 ```python
 import numpy as np
 # Get embedding from Ollama
 embedding = await get_embedding(client, model, combined_text)
 # Calculate L2 norm (Euclidean length)
 vec = np.array(embedding)
 magnitude = float(np.linalg.norm(vec))
 # magnitude = sqrt(sum(x_i^2 for all dimensions))
 ```
 ### Score Normalization
 ```python
 # Linear interpolation (inverted for BGE-reranker-v2-m3)
 score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
 # Clamp to [0, 1]
 score = min(max(score, 0.0), 1.0)
 ```
 ### Example Magnitude Distributions
 From real queries to BGE-reranker-v2-m3:
 ```
 Query: "Was ist eine Catalog Node ID?"
 Highly relevant docs: magnitude ~15.30 - 15.98 → score 0.95-0.97
 Moderately relevant:   magnitude ~17.00 - 19.00 → score 0.70-0.85
 Weakly relevant:       magnitude ~20.00 - 24.00 → score 0.20-0.50
 Irrelevant:           magnitude ~25.00 - 28.00 → score 0.00-0.10
 ```
 ## Alternatives
 ### 1. Use sentence-transformers (Recommended for Production)
 ```python
 from sentence_transformers import CrossEncoder
 model = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu')
 scores = model.predict([(query, doc) for doc in documents])
 ```
 **Pros:** Accurate, reliable, proper implementation  
 **Cons:** ~200MB VRAM/RAM, separate from Ollama
 ### 2. Request Ollama Feature
 Open an issue on [Ollama's GitHub](https://github.com/ollama/ollama) requesting native `/api/rerank` support.
 ### 3. Use API Services
 Services like Cohere, Jina AI, or Voyage AI offer reranking APIs.
 ## Requirements
 ```
 fastapi>=0.104.0
 uvicorn>=0.24.0
 httpx>=0.25.0
 pydantic>=2.0.0
 numpy>=1.24.0
 ```
 ## Contributing
 This is a workaround for a missing feature. Contributions welcome for:
 - Calibration configs for additional models
 - Auto-calibration logic
 - Alternative prompt formats
 - Better normalization strategies
 But remember: **The best contribution would be native Ollama support.**
 ## License
 MIT
 ## Disclaimer
 This is an **experimental workaround** that exploits undocumented behavior. It is:
 - Not endorsed by Ollama or BAAI
 - Not guaranteed to work across models or versions
 - Not suitable for production use without extensive testing
 - A temporary solution until native reranking support exists
 **Use at your own risk and always validate results against ground truth.**
--- a/plugins/reranking-endpoint/api.py
+++ b/plugins/reranking-endpoint/api.py
@@ -0,0 +1,186 @@
 import asyncio
 import httpx
 import logging
 from fastapi import FastAPI, HTTPException
 from pydantic import BaseModel
 from typing import List, Optional
 import numpy as np
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
 app = FastAPI(title="Ollama BGE Reranker (Working Workaround)")
 class RerankRequest(BaseModel):
    model: str
    query: str
    documents: List[str]
    top_n: Optional[int] = 3
 class RerankResult(BaseModel):
    index: int
    relevance_score: float
    document: Optional[str] = None
 class RerankResponse(BaseModel):
    results: List[RerankResult]
 async def get_embedding(
    client: httpx.AsyncClient,
    model: str,
    text: str
 ) -> Optional[List[float]]:
    """Get embedding from Ollama."""
    url = "http://localhost:11434/api/embeddings"
    try:
        response = await client.post(
            url,
            json={"model": model, "prompt": text},
            timeout=30.0
        )
        response.raise_for_status()
        return response.json().get("embedding")
    except Exception as e:
        logger.error(f"Error getting embedding: {e}")
        return None
 async def score_document_cross_encoder_workaround(
    client: httpx.AsyncClient,
    model: str,
    query: str,
    doc: str,
    index: int
 ) -> dict:
    """
    Workaround for using BGE-reranker with Ollama.
    Based on: https://medium.com/@rosgluk/reranking-documents-with-ollama-and-qwen3-reranker-model-in-go-6dc9c2fb5f0b
    Key discovery: When using concatenated query+doc embeddings,
    LOWER magnitude = MORE relevant. We invert the scores so that
    higher values = more relevant (standard convention).
    Steps:
    1. Concatenate query and document in cross-encoder format
    2. Get embedding of the concatenated text
    3. Calculate magnitude (lower = more relevant)
    4. Invert and normalize to 0-1 (higher = more relevant)
    """
    # Format as cross-encoder input
    # The format matters - reranker models expect specific patterns
    combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
    # Get embedding
    embedding = await get_embedding(client, model, combined)
    if embedding is None:
        logger.warning(f"Failed to get embedding for document {index}")
        return {
            "index": index,
            "relevance_score": 0.0,
            "document": doc
        }
    # Calculate magnitude (L2 norm) of the embedding vector
    vec = np.array(embedding)
    magnitude = float(np.linalg.norm(vec))
    # CRITICAL DISCOVERY: For BGE-reranker via Ollama embeddings:
    # LOWER magnitude = MORE relevant document
    # Observed range: ~15-25 (lower = better)
    # Invert and normalize to 0-1 where higher score = more relevant
    # Adjusted bounds based on empirical observations
    typical_good_magnitude = 15.0   # Highly relevant documents
    typical_poor_magnitude = 25.0   # Irrelevant documents
    # Linear interpolation (inverted)
    # magnitude 15 → score ~0.9
    # magnitude 25 → score ~0.0
    score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
    # Clamp to 0-1 range
    score = min(max(score, 0.0), 1.0)
    logger.debug(f"Doc {index}: magnitude={magnitude:.2f}, score={score:.4f}")
    logger.info(f"Raw magnitude: {magnitude:.2f}")
    return {
        "index": index,
        "relevance_score": score,
        "document": doc
    }
@app.on_event("startup")
 async def check_ollama():
    """Verify Ollama is accessible on startup."""
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get("http://localhost:11434/api/tags", timeout=5.0)
            response.raise_for_status()
            logger.info("✓ Successfully connected to Ollama")
            logger.warning("⚠️  Using workaround: concatenation + magnitude")
            logger.warning("⚠️  This is less accurate than proper cross-encoder usage")
    except Exception as e:
        logger.error(f"✗ Cannot connect to Ollama: {e}")
@app.post("/v1/rerank", response_model=RerankResponse)
 async def rerank(request: RerankRequest):
    """
    Rerank documents using BGE-reranker via Ollama workaround.
    NOTE: This uses a workaround (magnitude of concatenated embeddings)
    because Ollama doesn't expose BGE's classification head.
    For best accuracy, use sentence-transformers directly.
    """
    if not request.documents:
        raise HTTPException(status_code=400, detail="No documents provided")
    logger.info(f"Reranking {len(request.documents)} documents (workaround method)")
    logger.info(f"Query: {request.query[:100]}...")
    async with httpx.AsyncClient() as client:
        # Score all documents concurrently
        tasks = [
            score_document_cross_encoder_workaround(
                client, request.model, request.query, doc, i
            )
            for i, doc in enumerate(request.documents)
        ]
        results = await asyncio.gather(*tasks)
        # Sort by score DESCENDING (higher score = more relevant)
        # Scores are now inverted, so higher = better
        results.sort(key=lambda x: x["relevance_score"], reverse=True)
        # Log scores
        top_scores = [f"{r['relevance_score']:.4f}" for r in results[:request.top_n]]
        logger.info(f"Top {len(top_scores)} scores: {top_scores}")
        return {"results": results[:request.top_n]}
@app.get("/health")
 def health_check():
    """Health check endpoint."""
    return {
        "status": "healthy",
        "service": "ollama-bge-reranker-workaround",
        "note": "Using magnitude workaround - less accurate than native"
    }
 if __name__ == "__main__":
    import uvicorn
    logger.info("=" * 60)
    logger.info("Ollama BGE Reranker - WORKAROUND Implementation")
    logger.info("=" * 60)
    logger.info("Using concatenation + magnitude method")
    logger.info("This works but is less accurate than proper cross-encoders")
    logger.info("Starting on: http://0.0.0.0:8080")
    logger.info("=" * 60)
    uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info")
--- a/plugins/reranking-endpoint/requirements.txt
+++ b/plugins/reranking-endpoint/requirements.txt
@@ -0,0 +1,5 @@
 fastapi
 uvicorn
 httpx
 pydantic
 requests