improvements

This commit is contained in:
2026-01-16 12:48:56 +01:00
parent 514bd9b571
commit 345aa419c7
9 changed files with 3966 additions and 204 deletions

257
README.md
View File

@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis
- **🌐 Interactive Web Dashboard** (New!)
- Visual analytics with charts and graphs
- Advanced intelligence metrics
- Filtering, sorting, and statistical analysis
- Multi-dimensional performance evaluation
## Quick Start
@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
```bash
# Python 3.8+
pip install pyyaml requests
pip install -r requirements.txt
# or manually:
pip install pyyaml requests python-dotenv
```
### Installation
```bash
# Clone or download the files
# Ensure these files are in your working directory:
# - ai_eval.py
# - analyze_results.py
# - test_suite.yaml
# Copy the example environment file
cp .env.example .env
# Edit .env with your settings
# - Configure the model under test (MUT_*)
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
# - Set NON_INTERACTIVE=true for automated evaluation
nano .env
```
### Configuration with .env File (Recommended)
The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
```bash
# Model Under Test (MUT) - The model being evaluated
MUT_ENDPOINT=http://localhost:11434
MUT_API_KEY= # Optional for local endpoints
MUT_MODEL=qwen3:4b-q4_K_M
# Evaluator API - For non-interactive automated scoring
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_API_KEY= # Optional
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
# Execution Mode
NON_INTERACTIVE=false # Set to true for automated evaluation
TEST_SUITE=test_suite.yaml
OUTPUT_DIR=results
FILTER_CATEGORY= # Optional: filter by category
```
### Basic Usage
#### 1. Test a Single Model
#### 0. Test Connectivity (Dry Run)
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
```bash
# For Ollama (default: http://localhost:11434)
# Test MUT endpoint connectivity
python ai_eval.py --dry-run
# Test with specific configuration
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
# Test non-interactive mode (tests both MUT and evaluator endpoints)
python ai_eval.py --non-interactive --dry-run
# Test multiple models
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
```
The dry-run mode will:
- Test connectivity to the model under test endpoint(s)
- Verify authentication (API keys)
- Confirm model availability
- Test evaluator endpoint if in non-interactive mode
- Exit with success/failure status
#### 1. Interactive Mode (Manual Evaluation)
```bash
# Using .env file
python ai_eval.py
# Or with command-line arguments
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
# For other endpoints with API key
@@ -69,33 +131,94 @@ python ai_eval.py \
--model your-model-name
```
#### 2. Test Multiple Models (Quantization Comparison)
#### 2. Non-Interactive Mode (Automated Evaluation)
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
```bash
# Test different quantizations of qwen3:4b
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
# Configure .env file
NON_INTERACTIVE=true
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_MODEL=qwen3:14b
# Test different model sizes
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
# Run the test
python ai_eval.py
# Or with command-line arguments
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b-q4_K_M \
--non-interactive \
--evaluator-endpoint http://localhost:11434 \
--evaluator-model qwen3:14b
```
#### 3. Filter by Category
**How Non-Interactive Mode Works:**
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
- The evaluator model analyzes the response and returns a score (0-5) with notes
- This enables automated, consistent scoring across multiple model runs
- The evaluator uses a specialized system prompt designed for objective evaluation
**Choosing an Evaluator Model:**
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
- The evaluator model should be more capable than the model under test
- Lower temperature (0.3) provides more consistent scoring
#### 3. Test Multiple Models (Batch Mode)
Test multiple models in one run by specifying comma-separated model names:
```bash
# In .env file
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Run batch test
python ai_eval.py
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
```
The script will automatically test each model sequentially and save individual results.
#### 4. Filter by Category
```bash
# Test only IT Forensics categories
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b \
--category "IT Forensics - File Systems"
python ai_eval.py --category "IT Forensics - File Systems"
```
#### 4. Analyze Results
#### 5. Analyze Results
```bash
# Compare all tested models
## Analyzing Results
### Interactive Web Dashboard (Recommended)
Launch the comprehensive web interface for visual analysis:
```bash
# Start web dashboard (opens automatically in browser)
python analyze_results.py --web
# Custom host/port
python analyze_results.py --web --host 0.0.0.0 --port 8080
```
**Features:**
- 📊 Visual comparison charts and graphs
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
- 🔍 Interactive filtering and sorting
- 📈 Statistical analysis (consistency, robustness)
- 📂 Category and difficulty breakdowns
- 💡 Multi-dimensional cognitive evaluation
See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
### Command-Line Analysis
```bash
# Compare all models
python analyze_results.py --compare
# Detailed report for specific model
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
├── ai_eval.py # Main testing script
├── analyze_results.py # Results analysis and comparison
├── test_suite.yaml # Test definitions
├── .env.example # Configuration template
├── results/ # Auto-created results directory
│ ├── qwen3_4b-q4_K_M_latest.json
│ ├── qwen3_4b-q8_0_latest.json
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
└── README.md
```
## Configuration Reference
### Environment Variables (.env file)
All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
#### Model Under Test (MUT)
| Variable | Description | Example |
| --- | --- | --- |
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
#### Evaluator Configuration (for Non-Interactive Mode)
| Variable | Description | Example |
| --- | --- | --- |
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
#### Test Configuration
| Variable | Description | Example |
| --- | --- | --- |
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
| `OUTPUT_DIR` | Results output directory | `results` |
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
### Command-Line Arguments
All environment variables have corresponding command-line flags:
```bash
python ai_eval.py --help
Options:
--endpoint ENDPOINT Model under test endpoint
--api-key API_KEY Model under test API key
--model MODEL Model name to test
--test-suite FILE Test suite YAML file
--output-dir DIR Output directory
--category CATEGORY Filter by category
--non-interactive Enable automated evaluation
--evaluator-endpoint ENDPOINT Evaluator API endpoint
--evaluator-api-key KEY Evaluator API key
--evaluator-model MODEL Evaluator model name
--evaluator-temperature TEMP Evaluator temperature
```
## Advanced Usage
### Custom Test Suite
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
expected_difficulty: "medium" # medium, hard, very_hard
```
### Batch Testing Script
### Batch Testing Examples
Create `batch_test.sh`:
Testing multiple models using the `.env` configuration:
```bash
#!/bin/bash
# Configure .env with multiple models
cp .env.example .env
nano .env
ENDPOINT="http://localhost:11434"
# Set multiple models (comma-separated)
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Test all qwen3:4b quantizations
for quant in q4_K_M q8_0 fp16; do
echo "Testing qwen3:4b-${quant}..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
done
# Run batch tests
python ai_eval.py
# Test all sizes with q4_K_M
for size in 4b 8b 14b; do
echo "Testing qwen3:${size}-q4_K_M..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
done
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
# Generate comparison
# Generate comparison after testing
python analyze_results.py --compare
```
@@ -244,8 +419,8 @@ python analyze_results.py --compare
For OpenAI-compatible cloud services:
```bash
python ai_eval.py \
--endpoint https://api.service.com \
--api-key your-api-key \
--model model-name
# In .env file
MUT_ENDPOINT=https://api.service.com
MUT_API_KEY=your-api-key
MUT_MODEL=model-name
```