improvements

2026-01-16 12:48:56 +01:00
parent 514bd9b571
commit 345aa419c7
9 changed files with 3966 additions and 204 deletions
--- a/README.md
+++ b/README.md
@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
  - Category-wise performance breakdown
  - Difficulty-based analysis
  - CSV export for further analysis
+  - **🌐 Interactive Web Dashboard** (New!)
+    - Visual analytics with charts and graphs
+    - Advanced intelligence metrics
+    - Filtering, sorting, and statistical analysis
+    - Multi-dimensional performance evaluation

 ## Quick Start

@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks

 ```bash
 # Python 3.8+
-pip install pyyaml requests
+pip install -r requirements.txt
+# or manually:
+pip install pyyaml requests python-dotenv
 ```

 ### Installation

 ```bash
 # Clone or download the files
-# Ensure these files are in your working directory:
-# - ai_eval.py
-# - analyze_results.py
-# - test_suite.yaml
+# Copy the example environment file
+cp .env.example .env
+
+# Edit .env with your settings
+# - Configure the model under test (MUT_*)
+# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
+# - Set NON_INTERACTIVE=true for automated evaluation
+nano .env
+```
+
+### Configuration with .env File (Recommended)
+
+The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
+
+```bash
+# Model Under Test (MUT) - The model being evaluated
+MUT_ENDPOINT=http://localhost:11434
+MUT_API_KEY=                         # Optional for local endpoints
+MUT_MODEL=qwen3:4b-q4_K_M
+
+# Evaluator API - For non-interactive automated scoring
+EVALUATOR_ENDPOINT=http://localhost:11434
+EVALUATOR_API_KEY=                   # Optional
+EVALUATOR_MODEL=qwen3:14b           # Use a capable model for evaluation
+EVALUATOR_TEMPERATURE=0.3           # Lower = more consistent scoring
+
+# Execution Mode
+NON_INTERACTIVE=false               # Set to true for automated evaluation
+TEST_SUITE=test_suite.yaml
+OUTPUT_DIR=results
+FILTER_CATEGORY=                    # Optional: filter by category
 ```

 ### Basic Usage

-#### 1. Test a Single Model
+#### 0. Test Connectivity (Dry Run)
+
+Before running the full test suite, verify that your API endpoints are reachable and properly configured:

 ```bash
-# For Ollama (default: http://localhost:11434)
+# Test MUT endpoint connectivity
+python ai_eval.py --dry-run
+
+# Test with specific configuration
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
+
+# Test non-interactive mode (tests both MUT and evaluator endpoints)
+python ai_eval.py --non-interactive --dry-run
+
+# Test multiple models
+python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
+```
+
+The dry-run mode will:
+- Test connectivity to the model under test endpoint(s)
+- Verify authentication (API keys)
+- Confirm model availability
+- Test evaluator endpoint if in non-interactive mode
+- Exit with success/failure status
+
+#### 1. Interactive Mode (Manual Evaluation)
+
+```bash
+# Using .env file
+python ai_eval.py
+
+# Or with command-line arguments
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

 # For other endpoints with API key
@@ -69,33 +131,94 @@ python ai_eval.py \
  --model your-model-name
 ```

-#### 2. Test Multiple Models (Quantization Comparison)
+#### 2. Non-Interactive Mode (Automated Evaluation)
+
+Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.

 ```bash
-# Test different quantizations of qwen3:4b
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
+# Configure .env file
+NON_INTERACTIVE=true
+EVALUATOR_ENDPOINT=http://localhost:11434
+EVALUATOR_MODEL=qwen3:14b

-# Test different model sizes
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
+# Run the test
+python ai_eval.py
+
+# Or with command-line arguments
+python ai_eval.py \
+  --endpoint http://localhost:11434 \
+  --model qwen3:4b-q4_K_M \
+  --non-interactive \
+  --evaluator-endpoint http://localhost:11434 \
+  --evaluator-model qwen3:14b
 ```

-#### 3. Filter by Category
+**How Non-Interactive Mode Works:**
+- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
+- The evaluator model analyzes the response and returns a score (0-5) with notes
+- This enables automated, consistent scoring across multiple model runs
+- The evaluator uses a specialized system prompt designed for objective evaluation
+
+**Choosing an Evaluator Model:**
+- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
+- The evaluator model should be more capable than the model under test
+- Lower temperature (0.3) provides more consistent scoring
+
+#### 3. Test Multiple Models (Batch Mode)
+
+Test multiple models in one run by specifying comma-separated model names:
+
+```bash
+# In .env file
+MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
+
+# Run batch test
+python ai_eval.py
+
+# Or via command line
+python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
+```
+
+The script will automatically test each model sequentially and save individual results.
+
+#### 4. Filter by Category

 ```bash
 # Test only IT Forensics categories
-python ai_eval.py \
-  --endpoint http://localhost:11434 \
-  --model qwen3:4b \
-  --category "IT Forensics - File Systems"
+python ai_eval.py --category "IT Forensics - File Systems"
 ```

-#### 4. Analyze Results
+#### 5. Analyze Results

 ```bash
-# Compare all tested models
+## Analyzing Results
+
+### Interactive Web Dashboard (Recommended)
+
+Launch the comprehensive web interface for visual analysis:
+
+```bash
+# Start web dashboard (opens automatically in browser)
+python analyze_results.py --web
+
+# Custom host/port
+python analyze_results.py --web --host 0.0.0.0 --port 8080
+```
+
+**Features:**
+- 📊 Visual comparison charts and graphs
+- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
+- 🔍 Interactive filtering and sorting
+- 📈 Statistical analysis (consistency, robustness)
+- 📂 Category and difficulty breakdowns
+- 💡 Multi-dimensional cognitive evaluation
+
+See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
+
+### Command-Line Analysis
+
+```bash
+# Compare all models
 python analyze_results.py --compare

 # Detailed report for specific model
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
 ├── ai_eval.py              # Main testing script
 ├── analyze_results.py      # Results analysis and comparison
 ├── test_suite.yaml         # Test definitions
+├── .env.example            # Configuration template
 ├── results/                # Auto-created results directory
 │   ├── qwen3_4b-q4_K_M_latest.json
 │   ├── qwen3_4b-q8_0_latest.json
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
 └── README.md
 ```

+## Configuration Reference
+
+### Environment Variables (.env file)
+
+All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
+
+#### Model Under Test (MUT)
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
+| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
+| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
+
+#### Evaluator Configuration (for Non-Interactive Mode)
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
+| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
+| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
+| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
+
+#### Test Configuration
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
+| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
+| `OUTPUT_DIR` | Results output directory | `results` |
+| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
+
+### Command-Line Arguments
+
+All environment variables have corresponding command-line flags:
+
+```bash
+python ai_eval.py --help
+
+Options:
+  --endpoint ENDPOINT                Model under test endpoint
+  --api-key API_KEY                  Model under test API key
+  --model MODEL                      Model name to test
+  --test-suite FILE                  Test suite YAML file
+  --output-dir DIR                   Output directory
+  --category CATEGORY                Filter by category
+  --non-interactive                  Enable automated evaluation
+  --evaluator-endpoint ENDPOINT      Evaluator API endpoint
+  --evaluator-api-key KEY            Evaluator API key
+  --evaluator-model MODEL            Evaluator model name
+  --evaluator-temperature TEMP       Evaluator temperature
+```
+
+
 ## Advanced Usage

 ### Custom Test Suite
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
      expected_difficulty: "medium"  # medium, hard, very_hard
 ```

-### Batch Testing Script
+### Batch Testing Examples

-Create `batch_test.sh`:
+Testing multiple models using the `.env` configuration:

 ```bash
-#!/bin/bash
+# Configure .env with multiple models
+cp .env.example .env
+nano .env

-ENDPOINT="http://localhost:11434"
+# Set multiple models (comma-separated)
+MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

-# Test all qwen3:4b quantizations
-for quant in q4_K_M q8_0 fp16; do
-    echo "Testing qwen3:4b-${quant}..."
-    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
-done
+# Run batch tests
+python ai_eval.py

-# Test all sizes with q4_K_M
-for size in 4b 8b 14b; do
-    echo "Testing qwen3:${size}-q4_K_M..."
-    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
-done
+# Or via command line
+python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M

-# Generate comparison
+# Generate comparison after testing
 python analyze_results.py --compare
 ```

@@ -244,8 +419,8 @@ python analyze_results.py --compare
 For OpenAI-compatible cloud services:

 ```bash
-python ai_eval.py \
-  --endpoint https://api.service.com \
-  --api-key your-api-key \
-  --model model-name
+# In .env file
+MUT_ENDPOINT=https://api.service.com
+MUT_API_KEY=your-api-key
+MUT_MODEL=model-name
 ```