add rerank endpoint
This commit is contained in:
306
plugins/reranking-endpoint/README.md
Normal file
306
plugins/reranking-endpoint/README.md
Normal file
@@ -0,0 +1,306 @@
|
|||||||
|
# Ollama Reranker Workaround
|
||||||
|
|
||||||
|
> **⚠️ Important:** This is a **workaround/hack**, not a proper solution. It exploits an undocumented behavior of embedding magnitudes and should be used with caution.
|
||||||
|
|
||||||
|
A FastAPI service that provides document reranking using Ollama's embedding endpoint. This exists because Ollama does not natively support a `/api/rerank` endpoint for cross-encoder reranker models.
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
Cross-encoder reranker models (like BGE-reranker-v2-m3) are designed to score query-document pairs for relevance. However:
|
||||||
|
|
||||||
|
- **Ollama has no `/api/rerank` endpoint** - reranker models can't be used as intended
|
||||||
|
- **`/api/embeddings`** - returns embeddings, not classification scores
|
||||||
|
- **`/api/generate`** - reranker models can't generate text (they output uniform scores like 0.5)
|
||||||
|
|
||||||
|
## The Workaround
|
||||||
|
|
||||||
|
This service uses a magnitude-based approach:
|
||||||
|
|
||||||
|
1. Concatenates query and document in cross-encoder format: `"Query: {query}\n\nDocument: {doc}\n\nRelevance:"`
|
||||||
|
2. Gets embedding vector from Ollama's `/api/embeddings` endpoint
|
||||||
|
3. Calculates the L2 norm (magnitude) of the embedding vector
|
||||||
|
4. **Key discovery:** For BGE-reranker-v2-m3, **lower magnitude = more relevant**
|
||||||
|
5. Inverts and normalizes to 0-1 range where higher score = more relevant
|
||||||
|
|
||||||
|
### Why This Works (Sort Of)
|
||||||
|
|
||||||
|
When a cross-encoder model processes a query-document pair through the embedding endpoint, the embedding's magnitude appears to correlate with relevance for some models. This is:
|
||||||
|
- **Not documented behavior**
|
||||||
|
- **Not guaranteed across models**
|
||||||
|
- **Not the intended use of the embedding endpoint**
|
||||||
|
- **Less accurate than proper cross-encoder scoring**
|
||||||
|
|
||||||
|
But it's the only way to use reranker models with Ollama right now.
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
### ⚠️ Critical Limitations
|
||||||
|
|
||||||
|
1. **Model-Specific Behavior**
|
||||||
|
- Magnitude ranges differ between models (BGE: 15-28, others: unknown)
|
||||||
|
- Correlation direction may vary (lower/higher = more relevant)
|
||||||
|
- Requires manual calibration per model
|
||||||
|
|
||||||
|
2. **No Theoretical Foundation**
|
||||||
|
- Exploits accidental behavior, not designed functionality
|
||||||
|
- Could break with model updates
|
||||||
|
- No guarantee of correctness
|
||||||
|
|
||||||
|
3. **Less Accurate Than Proper Methods**
|
||||||
|
- Native cross-encoder scoring is more accurate
|
||||||
|
- sentence-transformers library is the gold standard
|
||||||
|
- This is a compromise for GPU scheduling benefits
|
||||||
|
|
||||||
|
4. **Embedding Dimension Dependency**
|
||||||
|
- Magnitude scales with dimensionality (384 vs 768 vs 1024)
|
||||||
|
- Models with different dimensions need different calibration
|
||||||
|
|
||||||
|
5. **Performance**
|
||||||
|
- Requires one API call per document (40 docs = 40 calls)
|
||||||
|
- Slower than native reranking would be
|
||||||
|
- Fast but not optimal
|
||||||
|
|
||||||
|
## When To Use This
|
||||||
|
|
||||||
|
✅ **Use if:**
|
||||||
|
- You need Ollama's GPU scheduling for multiple models
|
||||||
|
- VRAM is constrained and you can't run separate services
|
||||||
|
- You're okay with reduced accuracy vs sentence-transformers
|
||||||
|
- You can tolerate model-specific calibration
|
||||||
|
|
||||||
|
❌ **Don't use if:**
|
||||||
|
- You need reliable, production-grade reranking
|
||||||
|
- You need cross-model consistency
|
||||||
|
- You have VRAM for sentence-transformers (~200MB for reranker only)
|
||||||
|
- Accuracy is critical
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/yourusername/ollama-reranker-workaround.git
|
||||||
|
cd ollama-reranker-workaround
|
||||||
|
|
||||||
|
# Create virtual environment
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
# Ensure Ollama is running with a reranker model
|
||||||
|
ollama pull qllama/bge-reranker-v2-m3
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Start the Service
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python api.py
|
||||||
|
```
|
||||||
|
|
||||||
|
The service runs on `http://0.0.0.0:8080`
|
||||||
|
|
||||||
|
### API Request
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8080/v1/rerank \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qllama/bge-reranker-v2-m3:latest",
|
||||||
|
"query": "What is machine learning?",
|
||||||
|
"documents": [
|
||||||
|
"Machine learning is a subset of artificial intelligence.",
|
||||||
|
"The weather today is sunny.",
|
||||||
|
"Neural networks are used in deep learning."
|
||||||
|
],
|
||||||
|
"top_n": 2
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Response
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"results": [
|
||||||
|
{
|
||||||
|
"index": 0,
|
||||||
|
"relevance_score": 0.9234,
|
||||||
|
"document": "Machine learning is a subset of artificial intelligence."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"index": 2,
|
||||||
|
"relevance_score": 0.7845,
|
||||||
|
"document": "Neural networks are used in deep learning."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration & Tunables
|
||||||
|
|
||||||
|
### Model Calibration
|
||||||
|
|
||||||
|
The most critical parameters are in `score_document_cross_encoder_workaround()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Magnitude bounds (model-specific!)
|
||||||
|
typical_good_magnitude = 15.0 # Highly relevant documents
|
||||||
|
typical_poor_magnitude = 25.0 # Irrelevant documents
|
||||||
|
|
||||||
|
# For BGE-reranker-v2-m3, observed range is ~15-28
|
||||||
|
# Lower magnitude = more relevant (inverted correlation)
|
||||||
|
```
|
||||||
|
|
||||||
|
### How to Calibrate for a New Model
|
||||||
|
|
||||||
|
1. **Enable magnitude logging:**
|
||||||
|
```python
|
||||||
|
logger.info(f"Raw magnitude: {magnitude:.2f}")
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Test with known relevant/irrelevant documents:**
|
||||||
|
```python
|
||||||
|
# Send queries with obviously relevant and irrelevant docs
|
||||||
|
# Observe magnitude ranges in logs
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Determine correlation direction:**
|
||||||
|
- If relevant docs have **lower** magnitudes → set `invert = True`
|
||||||
|
- If relevant docs have **higher** magnitudes → set `invert = False`
|
||||||
|
|
||||||
|
4. **Set bounds:**
|
||||||
|
```python
|
||||||
|
# Find 90th percentile of relevant doc magnitudes
|
||||||
|
typical_good_magnitude = <observed_value>
|
||||||
|
|
||||||
|
# Find 10th percentile of irrelevant doc magnitudes
|
||||||
|
typical_poor_magnitude = <observed_value>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Prompt Format Tuning
|
||||||
|
|
||||||
|
The concatenation format may affect results:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Current format (works for BGE-reranker-v2-m3)
|
||||||
|
combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
|
||||||
|
|
||||||
|
# Alternative formats to try:
|
||||||
|
combined = f"{query} [SEP] {doc}"
|
||||||
|
combined = f"query: {query} document: {doc}"
|
||||||
|
combined = f"<query>{query}</query><document>{doc}</document>"
|
||||||
|
```
|
||||||
|
|
||||||
|
Test different formats and check if score distributions improve.
|
||||||
|
|
||||||
|
### Concurrency Settings
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In the rerank() endpoint
|
||||||
|
# Process all documents concurrently (default)
|
||||||
|
tasks = [score_document(...) for doc in documents]
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
# Or batch for rate limiting:
|
||||||
|
batch_size = 10
|
||||||
|
for i in range(0, len(documents), batch_size):
|
||||||
|
batch = documents[i:i+batch_size]
|
||||||
|
# process batch...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Technical Details
|
||||||
|
|
||||||
|
### Magnitude Calculation
|
||||||
|
|
||||||
|
```python
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# Get embedding from Ollama
|
||||||
|
embedding = await get_embedding(client, model, combined_text)
|
||||||
|
|
||||||
|
# Calculate L2 norm (Euclidean length)
|
||||||
|
vec = np.array(embedding)
|
||||||
|
magnitude = float(np.linalg.norm(vec))
|
||||||
|
# magnitude = sqrt(sum(x_i^2 for all dimensions))
|
||||||
|
```
|
||||||
|
|
||||||
|
### Score Normalization
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Linear interpolation (inverted for BGE-reranker-v2-m3)
|
||||||
|
score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
|
||||||
|
|
||||||
|
# Clamp to [0, 1]
|
||||||
|
score = min(max(score, 0.0), 1.0)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Example Magnitude Distributions
|
||||||
|
|
||||||
|
From real queries to BGE-reranker-v2-m3:
|
||||||
|
|
||||||
|
```
|
||||||
|
Query: "Was ist eine Catalog Node ID?"
|
||||||
|
|
||||||
|
Highly relevant docs: magnitude ~15.30 - 15.98 → score 0.95-0.97
|
||||||
|
Moderately relevant: magnitude ~17.00 - 19.00 → score 0.70-0.85
|
||||||
|
Weakly relevant: magnitude ~20.00 - 24.00 → score 0.20-0.50
|
||||||
|
Irrelevant: magnitude ~25.00 - 28.00 → score 0.00-0.10
|
||||||
|
```
|
||||||
|
|
||||||
|
## Alternatives
|
||||||
|
|
||||||
|
### 1. Use sentence-transformers (Recommended for Production)
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sentence_transformers import CrossEncoder
|
||||||
|
|
||||||
|
model = CrossEncoder('BAAI/bge-reranker-v2-m3', device='cpu')
|
||||||
|
scores = model.predict([(query, doc) for doc in documents])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:** Accurate, reliable, proper implementation
|
||||||
|
**Cons:** ~200MB VRAM/RAM, separate from Ollama
|
||||||
|
|
||||||
|
### 2. Request Ollama Feature
|
||||||
|
|
||||||
|
Open an issue on [Ollama's GitHub](https://github.com/ollama/ollama) requesting native `/api/rerank` support.
|
||||||
|
|
||||||
|
### 3. Use API Services
|
||||||
|
|
||||||
|
Services like Cohere, Jina AI, or Voyage AI offer reranking APIs.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
```
|
||||||
|
fastapi>=0.104.0
|
||||||
|
uvicorn>=0.24.0
|
||||||
|
httpx>=0.25.0
|
||||||
|
pydantic>=2.0.0
|
||||||
|
numpy>=1.24.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
This is a workaround for a missing feature. Contributions welcome for:
|
||||||
|
- Calibration configs for additional models
|
||||||
|
- Auto-calibration logic
|
||||||
|
- Alternative prompt formats
|
||||||
|
- Better normalization strategies
|
||||||
|
|
||||||
|
But remember: **The best contribution would be native Ollama support.**
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
|
|
||||||
|
## Disclaimer
|
||||||
|
|
||||||
|
This is an **experimental workaround** that exploits undocumented behavior. It is:
|
||||||
|
- Not endorsed by Ollama or BAAI
|
||||||
|
- Not guaranteed to work across models or versions
|
||||||
|
- Not suitable for production use without extensive testing
|
||||||
|
- A temporary solution until native reranking support exists
|
||||||
|
|
||||||
|
**Use at your own risk and always validate results against ground truth.**
|
||||||
186
plugins/reranking-endpoint/api.py
Normal file
186
plugins/reranking-endpoint/api.py
Normal file
@@ -0,0 +1,186 @@
|
|||||||
|
import asyncio
|
||||||
|
import httpx
|
||||||
|
import logging
|
||||||
|
from fastapi import FastAPI, HTTPException
|
||||||
|
from pydantic import BaseModel
|
||||||
|
from typing import List, Optional
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||||
|
)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
app = FastAPI(title="Ollama BGE Reranker (Working Workaround)")
|
||||||
|
|
||||||
|
class RerankRequest(BaseModel):
|
||||||
|
model: str
|
||||||
|
query: str
|
||||||
|
documents: List[str]
|
||||||
|
top_n: Optional[int] = 3
|
||||||
|
|
||||||
|
class RerankResult(BaseModel):
|
||||||
|
index: int
|
||||||
|
relevance_score: float
|
||||||
|
document: Optional[str] = None
|
||||||
|
|
||||||
|
class RerankResponse(BaseModel):
|
||||||
|
results: List[RerankResult]
|
||||||
|
|
||||||
|
async def get_embedding(
|
||||||
|
client: httpx.AsyncClient,
|
||||||
|
model: str,
|
||||||
|
text: str
|
||||||
|
) -> Optional[List[float]]:
|
||||||
|
"""Get embedding from Ollama."""
|
||||||
|
url = "http://localhost:11434/api/embeddings"
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = await client.post(
|
||||||
|
url,
|
||||||
|
json={"model": model, "prompt": text},
|
||||||
|
timeout=30.0
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json().get("embedding")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error getting embedding: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def score_document_cross_encoder_workaround(
|
||||||
|
client: httpx.AsyncClient,
|
||||||
|
model: str,
|
||||||
|
query: str,
|
||||||
|
doc: str,
|
||||||
|
index: int
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Workaround for using BGE-reranker with Ollama.
|
||||||
|
Based on: https://medium.com/@rosgluk/reranking-documents-with-ollama-and-qwen3-reranker-model-in-go-6dc9c2fb5f0b
|
||||||
|
|
||||||
|
Key discovery: When using concatenated query+doc embeddings,
|
||||||
|
LOWER magnitude = MORE relevant. We invert the scores so that
|
||||||
|
higher values = more relevant (standard convention).
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1. Concatenate query and document in cross-encoder format
|
||||||
|
2. Get embedding of the concatenated text
|
||||||
|
3. Calculate magnitude (lower = more relevant)
|
||||||
|
4. Invert and normalize to 0-1 (higher = more relevant)
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Format as cross-encoder input
|
||||||
|
# The format matters - reranker models expect specific patterns
|
||||||
|
combined = f"Query: {query}\n\nDocument: {doc}\n\nRelevance:"
|
||||||
|
|
||||||
|
# Get embedding
|
||||||
|
embedding = await get_embedding(client, model, combined)
|
||||||
|
|
||||||
|
if embedding is None:
|
||||||
|
logger.warning(f"Failed to get embedding for document {index}")
|
||||||
|
return {
|
||||||
|
"index": index,
|
||||||
|
"relevance_score": 0.0,
|
||||||
|
"document": doc
|
||||||
|
}
|
||||||
|
|
||||||
|
# Calculate magnitude (L2 norm) of the embedding vector
|
||||||
|
vec = np.array(embedding)
|
||||||
|
magnitude = float(np.linalg.norm(vec))
|
||||||
|
|
||||||
|
# CRITICAL DISCOVERY: For BGE-reranker via Ollama embeddings:
|
||||||
|
# LOWER magnitude = MORE relevant document
|
||||||
|
# Observed range: ~15-25 (lower = better)
|
||||||
|
|
||||||
|
# Invert and normalize to 0-1 where higher score = more relevant
|
||||||
|
# Adjusted bounds based on empirical observations
|
||||||
|
typical_good_magnitude = 15.0 # Highly relevant documents
|
||||||
|
typical_poor_magnitude = 25.0 # Irrelevant documents
|
||||||
|
|
||||||
|
# Linear interpolation (inverted)
|
||||||
|
# magnitude 15 → score ~0.9
|
||||||
|
# magnitude 25 → score ~0.0
|
||||||
|
score = (typical_poor_magnitude - magnitude) / (typical_poor_magnitude - typical_good_magnitude)
|
||||||
|
|
||||||
|
# Clamp to 0-1 range
|
||||||
|
score = min(max(score, 0.0), 1.0)
|
||||||
|
|
||||||
|
logger.debug(f"Doc {index}: magnitude={magnitude:.2f}, score={score:.4f}")
|
||||||
|
logger.info(f"Raw magnitude: {magnitude:.2f}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"index": index,
|
||||||
|
"relevance_score": score,
|
||||||
|
"document": doc
|
||||||
|
}
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def check_ollama():
|
||||||
|
"""Verify Ollama is accessible on startup."""
|
||||||
|
try:
|
||||||
|
async with httpx.AsyncClient() as client:
|
||||||
|
response = await client.get("http://localhost:11434/api/tags", timeout=5.0)
|
||||||
|
response.raise_for_status()
|
||||||
|
logger.info("✓ Successfully connected to Ollama")
|
||||||
|
logger.warning("⚠️ Using workaround: concatenation + magnitude")
|
||||||
|
logger.warning("⚠️ This is less accurate than proper cross-encoder usage")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"✗ Cannot connect to Ollama: {e}")
|
||||||
|
|
||||||
|
@app.post("/v1/rerank", response_model=RerankResponse)
|
||||||
|
async def rerank(request: RerankRequest):
|
||||||
|
"""
|
||||||
|
Rerank documents using BGE-reranker via Ollama workaround.
|
||||||
|
|
||||||
|
NOTE: This uses a workaround (magnitude of concatenated embeddings)
|
||||||
|
because Ollama doesn't expose BGE's classification head.
|
||||||
|
For best accuracy, use sentence-transformers directly.
|
||||||
|
"""
|
||||||
|
if not request.documents:
|
||||||
|
raise HTTPException(status_code=400, detail="No documents provided")
|
||||||
|
|
||||||
|
logger.info(f"Reranking {len(request.documents)} documents (workaround method)")
|
||||||
|
logger.info(f"Query: {request.query[:100]}...")
|
||||||
|
|
||||||
|
async with httpx.AsyncClient() as client:
|
||||||
|
# Score all documents concurrently
|
||||||
|
tasks = [
|
||||||
|
score_document_cross_encoder_workaround(
|
||||||
|
client, request.model, request.query, doc, i
|
||||||
|
)
|
||||||
|
for i, doc in enumerate(request.documents)
|
||||||
|
]
|
||||||
|
results = await asyncio.gather(*tasks)
|
||||||
|
|
||||||
|
# Sort by score DESCENDING (higher score = more relevant)
|
||||||
|
# Scores are now inverted, so higher = better
|
||||||
|
results.sort(key=lambda x: x["relevance_score"], reverse=True)
|
||||||
|
|
||||||
|
# Log scores
|
||||||
|
top_scores = [f"{r['relevance_score']:.4f}" for r in results[:request.top_n]]
|
||||||
|
logger.info(f"Top {len(top_scores)} scores: {top_scores}")
|
||||||
|
|
||||||
|
return {"results": results[:request.top_n]}
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
def health_check():
|
||||||
|
"""Health check endpoint."""
|
||||||
|
return {
|
||||||
|
"status": "healthy",
|
||||||
|
"service": "ollama-bge-reranker-workaround",
|
||||||
|
"note": "Using magnitude workaround - less accurate than native"
|
||||||
|
}
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import uvicorn
|
||||||
|
|
||||||
|
logger.info("=" * 60)
|
||||||
|
logger.info("Ollama BGE Reranker - WORKAROUND Implementation")
|
||||||
|
logger.info("=" * 60)
|
||||||
|
logger.info("Using concatenation + magnitude method")
|
||||||
|
logger.info("This works but is less accurate than proper cross-encoders")
|
||||||
|
logger.info("Starting on: http://0.0.0.0:8080")
|
||||||
|
logger.info("=" * 60)
|
||||||
|
|
||||||
|
uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info")
|
||||||
5
plugins/reranking-endpoint/requirements.txt
Normal file
5
plugins/reranking-endpoint/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
fastapi
|
||||||
|
uvicorn
|
||||||
|
httpx
|
||||||
|
pydantic
|
||||||
|
requests
|
||||||
Reference in New Issue
Block a user