diff --git a/plugins/reranking-endpoint/README.md b/plugins/reranking-endpoint/README.md new file mode 100644 index 0000000..672c1a9 --- /dev/null +++ b/plugins/reranking-endpoint/README.md @@ -0,0 +1,306 @@ +# Ollama Reranker Workaround + +> **⚠️ Important:** This is a **workaround/hack**, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution. + +A FastAPI service that provides document reranking using Ollama's embedding endpoint. This exists because Ollama does not natively support a `/api/rerank` endpoint for cross-encoder reranker models. + +## The Problem + +Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However: + +- **Ollama has no `/api/rerank` endpoint** - reranker models can't be used as intended +- **`/api/embeddings`** - returns embeddings, not classification scores +- **`/api/generate`** - reranker models can't generate text (they output uniform scores like 0.5) + +## The Workaround + +This service uses a magnitude-based approach: + +1. Concatenates query and document in cross-encoder format: `"Query: {query}\n\nDocument: {doc}\n\nRelevance:"` +2. Gets embedding vector from Ollama's `/api/embeddings` endpoint +3. Calculates the L2 norm (magnitude) of the embedding vector +4. **Key discovery:** For BGE-reranker-v2-m3, **lower magnitude = more relevant** +5. Inverts and normalizes to 0-1 range where higher score = more relevant + +### Why This Works (Sort Of) + +When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is: +- **Not documented behavior** +- **Not guaranteed across models** +- **Not the intended use of the embedding endpoint** +- **Less accurate than proper cross-encoder scoring** + +But it's the only way to use reranker models with Ollama right now. + +## Limitations + +### ⚠️ Critical Limitations + +1. **Model-Specific Behavior** + - Magnitude ranges differ between models (BGE: 15-28, others: unknown) + - Correlation direction may vary (lower/higher = more relevant) + - Requires manual calibration per model + +2. **No Theoretical Foundation** + - Exploits accidental behavior, not designed functionality + - Could break with model updates + - No guarantee of correctness + +3. **Less Accurate Than Proper Methods** + - Native cross-encoder scoring is more accurate + - sentence-transformers library is the gold standard + - This is a compromise for GPU scheduling benefits + +4. **Embedding Dimension Dependency** + - Magnitude scales with dimensionality (384 vs 768 vs 1024) + - Models with different dimensions need different calibration + +5. **Performance** + - Requires one API call per document (40 docs = 40 calls) + - Slower than native reranking would be + - Fast but not optimal + +## When To Use This + +✅ **Use if:** +- You need Ollama's GPU scheduling for multiple models +- VRAM is constrained and you can't run separate services +- You're okay with reduced accuracy vs sentence-transformers +- You can tolerate model-specific calibration + +❌ **Don't use if:** +- You need reliable, production-grade reranking +- You need cross-model consistency +- You have VRAM for sentence-transformers (~200MB for reranker only) +- Accuracy is critical + +## Installation + +```bash +# Clone the repository +git clone https://github.com/yourusername/ollama-reranker-workaround.git +cd ollama-reranker-workaround + +# Create virtual environment +python3 -m venv .venv +source .venv/bin/activate # On Windows: .venv\Scripts\activate + +# Install dependencies +pip install -r requirements.txt + +# Ensure Ollama is running with a reranker model +ollama pull qllama/bge-reranker-v2-m3 +``` + +## Usage + +### Start the Service + +```bash +python api.py +``` + +The service runs on `http://0.0.0.0:8080` + +### API Request + +```bash +curl -X POST http://localhost:8080/v1/rerank \ + -H "Content-Type: application/json" \ + -d '{ + "model": "qllama/bge-reranker-v2-m3:latest", + "query": "What is machine learning?", + "documents": [ + "Machine learning is a subset of artificial intelligence.", + "The weather today is sunny.", + "Neural networks are used in deep learning." + ], + "top_n": 2 + }' +``` + +### Response + +```json +{ + "results": [ + { + "index": 0, + "relevance_score": 0.9234, + "document": "Machine learning is a subset of artificial intelligence." + }, + { + "index": 2, + "relevance_score": 0.7845, + "document": "Neural networks are used in deep learning." + } + ] +} +``` + +## Configuration & Tunables + +### Model Calibration + +The most critical parameters are in `score_document_cross_encoder_workaround()`: + +```python +# Magnitude bounds (model-specific!) +typical_good_magnitude = 15.0 # Highly relevant documents +typical_poor_magnitude = 25.0 # Irrelevant documents + +# For BGE-reranker-v2-m3, observed range is ~15-28 +# Lower magnitude = more relevant (inverted correlation) +``` + +### How to Calibrate for a New Model + +1. **Enable magnitude logging:** + ```python + logger.info(f"Raw magnitude: {magnitude:.2f}") + ``` + +2. **Test with known relevant/irrelevant documents:** + ```python + # Send queries with obviously relevant and irrelevant docs + # Observe magnitude ranges in logs + ``` + +3. **Determine correlation direction:** + - If relevant docs have **lower** magnitudes → set `invert = True` + - If relevant docs have **higher** magnitudes → set `invert = False` + +4. **Set bounds:** + ```python + # Find 90th percentile of relevant doc magnitudes + typical_good_magnitude = + + # Find 10th percentile of irrelevant doc magnitudes + typical_poor_magnitude = + ``` + +### Prompt Format Tuning + +The concatenation format may affect results: + +```python +# Current format (works for BGE-reranker-v2-m3) +combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:" + +# Alternative formats to try: +combined = f"{query} [SEP] {doc}" +combined = f"query: {query} document: {doc}" +combined = f"{query}{doc}" +``` + +Test different formats and check if score distributions improve. + +### Concurrency Settings + +```python +# In the rerank() endpoint +# Process all documents concurrently (default) +tasks = [score_document(...) for doc in documents] +results = await asyncio.gather(*tasks) + +# Or batch for rate limiting: +batch_size = 10 +for i in range(0, len(documents), batch_size): + batch = documents[i:i+batch_size] + # process batch... +``` + +## Technical Details + +### Magnitude Calculation + +```python +import numpy as np + +# Get embedding from Ollama +embedding = await get_embedding(client, model, combined_text) + +# Calculate L2 norm (Euclidean length) +vec = np.array(embedding) +magnitude = float(np.linalg.norm(vec)) +# magnitude = sqrt(sum(x_i^2 for all dimensions)) +``` + +### Score Normalization + +```python +# Linear interpolation (inverted for BGE-reranker-v2-m3) +score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude) + +# Clamp to [0, 1] +score = min(max(score, 0.0), 1.0) +``` + +### Example Magnitude Distributions + +From real queries to BGE-reranker-v2-m3: + +``` +Query: "Was ist eine Catalog Node ID?" + +Highly relevant docs: magnitude ~15.30 - 15.98 → score 0.95-0.97 +Moderately relevant: magnitude ~17.00 - 19.00 → score 0.70-0.85 +Weakly relevant: magnitude ~20.00 - 24.00 → score 0.20-0.50 +Irrelevant: magnitude ~25.00 - 28.00 → score 0.00-0.10 +``` + +## Alternatives + +### 1. Use sentence-transformers (Recommended for Production) + +```python +from sentence_transformers import CrossEncoder + +model = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu') +scores = model.predict([(query, doc) for doc in documents]) +``` + +**Pros:** Accurate, reliable, proper implementation +**Cons:** ~200MB VRAM/RAM, separate from Ollama + +### 2. Request Ollama Feature + +Open an issue on [Ollama's GitHub](https://github.com/ollama/ollama) requesting native `/api/rerank` support. + +### 3. Use API Services + +Services like Cohere, Jina AI, or Voyage AI offer reranking APIs. + +## Requirements + +``` +fastapi>=0.104.0 +uvicorn>=0.24.0 +httpx>=0.25.0 +pydantic>=2.0.0 +numpy>=1.24.0 +``` + +## Contributing + +This is a workaround for a missing feature. Contributions welcome for: +- Calibration configs for additional models +- Auto-calibration logic +- Alternative prompt formats +- Better normalization strategies + +But remember: **The best contribution would be native Ollama support.** + +## License + +MIT + +## Disclaimer + +This is an **experimental workaround** that exploits undocumented behavior. It is: +- Not endorsed by Ollama or BAAI +- Not guaranteed to work across models or versions +- Not suitable for production use without extensive testing +- A temporary solution until native reranking support exists + +**Use at your own risk and always validate results against ground truth.** diff --git a/plugins/reranking-endpoint/api.py b/plugins/reranking-endpoint/api.py new file mode 100644 index 0000000..7eb7516 --- /dev/null +++ b/plugins/reranking-endpoint/api.py @@ -0,0 +1,186 @@ +import asyncio +import httpx +import logging +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel +from typing import List, Optional +import numpy as np + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger(__name__) + +app = FastAPI(title="Ollama BGE Reranker (Working Workaround)") + +class RerankRequest(BaseModel): + model: str + query: str + documents: List[str] + top_n: Optional[int] = 3 + +class RerankResult(BaseModel): + index: int + relevance_score: float + document: Optional[str] = None + +class RerankResponse(BaseModel): + results: List[RerankResult] + +async def get_embedding( + client: httpx.AsyncClient, + model: str, + text: str +) -> Optional[List[float]]: + """Get embedding from Ollama.""" + url = "http://localhost:11434/api/embeddings" + + try: + response = await client.post( + url, + json={"model": model, "prompt": text}, + timeout=30.0 + ) + response.raise_for_status() + return response.json().get("embedding") + except Exception as e: + logger.error(f"Error getting embedding: {e}") + return None + +async def score_document_cross_encoder_workaround( + client: httpx.AsyncClient, + model: str, + query: str, + doc: str, + index: int +) -> dict: + """ + Workaround for using BGE-reranker with Ollama. + Based on: https://medium.com/@rosgluk/reranking-documents-with-ollama-and-qwen3-reranker-model-in-go-6dc9c2fb5f0b + + Key discovery: When using concatenated query+doc embeddings, + LOWER magnitude = MORE relevant. We invert the scores so that + higher values = more relevant (standard convention). + + Steps: + 1. Concatenate query and document in cross-encoder format + 2. Get embedding of the concatenated text + 3. Calculate magnitude (lower = more relevant) + 4. Invert and normalize to 0-1 (higher = more relevant) + """ + + # Format as cross-encoder input + # The format matters - reranker models expect specific patterns + combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:" + + # Get embedding + embedding = await get_embedding(client, model, combined) + + if embedding is None: + logger.warning(f"Failed to get embedding for document {index}") + return { + "index": index, + "relevance_score": 0.0, + "document": doc + } + + # Calculate magnitude (L2 norm) of the embedding vector + vec = np.array(embedding) + magnitude = float(np.linalg.norm(vec)) + + # CRITICAL DISCOVERY: For BGE-reranker via Ollama embeddings: + # LOWER magnitude = MORE relevant document + # Observed range: ~15-25 (lower = better) + + # Invert and normalize to 0-1 where higher score = more relevant + # Adjusted bounds based on empirical observations + typical_good_magnitude = 15.0 # Highly relevant documents + typical_poor_magnitude = 25.0 # Irrelevant documents + + # Linear interpolation (inverted) + # magnitude 15 → score ~0.9 + # magnitude 25 → score ~0.0 + score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude) + + # Clamp to 0-1 range + score = min(max(score, 0.0), 1.0) + + logger.debug(f"Doc {index}: magnitude={magnitude:.2f}, score={score:.4f}") + logger.info(f"Raw magnitude: {magnitude:.2f}") + + return { + "index": index, + "relevance_score": score, + "document": doc + } + +@app.on_event("startup") +async def check_ollama(): + """Verify Ollama is accessible on startup.""" + try: + async with httpx.AsyncClient() as client: + response = await client.get("http://localhost:11434/api/tags", timeout=5.0) + response.raise_for_status() + logger.info("✓ Successfully connected to Ollama") + logger.warning("⚠️ Using workaround: concatenation + magnitude") + logger.warning("⚠️ This is less accurate than proper cross-encoder usage") + except Exception as e: + logger.error(f"✗ Cannot connect to Ollama: {e}") + +@app.post("/v1/rerank", response_model=RerankResponse) +async def rerank(request: RerankRequest): + """ + Rerank documents using BGE-reranker via Ollama workaround. + + NOTE: This uses a workaround (magnitude of concatenated embeddings) + because Ollama doesn't expose BGE's classification head. + For best accuracy, use sentence-transformers directly. + """ + if not request.documents: + raise HTTPException(status_code=400, detail="No documents provided") + + logger.info(f"Reranking {len(request.documents)} documents (workaround method)") + logger.info(f"Query: {request.query[:100]}...") + + async with httpx.AsyncClient() as client: + # Score all documents concurrently + tasks = [ + score_document_cross_encoder_workaround( + client, request.model, request.query, doc, i + ) + for i, doc in enumerate(request.documents) + ] + results = await asyncio.gather(*tasks) + + # Sort by score DESCENDING (higher score = more relevant) + # Scores are now inverted, so higher = better + results.sort(key=lambda x: x["relevance_score"], reverse=True) + + # Log scores + top_scores = [f"{r['relevance_score']:.4f}" for r in results[:request.top_n]] + logger.info(f"Top {len(top_scores)} scores: {top_scores}") + + return {"results": results[:request.top_n]} + +@app.get("/health") +def health_check(): + """Health check endpoint.""" + return { + "status": "healthy", + "service": "ollama-bge-reranker-workaround", + "note": "Using magnitude workaround - less accurate than native" + } + +if __name__ == "__main__": + import uvicorn + + logger.info("=" * 60) + logger.info("Ollama BGE Reranker - WORKAROUND Implementation") + logger.info("=" * 60) + logger.info("Using concatenation + magnitude method") + logger.info("This works but is less accurate than proper cross-encoders") + logger.info("Starting on: http://0.0.0.0:8080") + logger.info("=" * 60) + + uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info") diff --git a/plugins/reranking-endpoint/requirements.txt b/plugins/reranking-endpoint/requirements.txt new file mode 100644 index 0000000..bf8f498 --- /dev/null +++ b/plugins/reranking-endpoint/requirements.txt @@ -0,0 +1,5 @@ +fastapi +uvicorn +httpx +pydantic +requests