improvements
This commit is contained in:
257
README.md
257
README.md
@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
|
||||
- Category-wise performance breakdown
|
||||
- Difficulty-based analysis
|
||||
- CSV export for further analysis
|
||||
- **🌐 Interactive Web Dashboard** (New!)
|
||||
- Visual analytics with charts and graphs
|
||||
- Advanced intelligence metrics
|
||||
- Filtering, sorting, and statistical analysis
|
||||
- Multi-dimensional performance evaluation
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
|
||||
|
||||
```bash
|
||||
# Python 3.8+
|
||||
pip install pyyaml requests
|
||||
pip install -r requirements.txt
|
||||
# or manually:
|
||||
pip install pyyaml requests python-dotenv
|
||||
```
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone or download the files
|
||||
# Ensure these files are in your working directory:
|
||||
# - ai_eval.py
|
||||
# - analyze_results.py
|
||||
# - test_suite.yaml
|
||||
# Copy the example environment file
|
||||
cp .env.example .env
|
||||
|
||||
# Edit .env with your settings
|
||||
# - Configure the model under test (MUT_*)
|
||||
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
|
||||
# - Set NON_INTERACTIVE=true for automated evaluation
|
||||
nano .env
|
||||
```
|
||||
|
||||
### Configuration with .env File (Recommended)
|
||||
|
||||
The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
|
||||
|
||||
```bash
|
||||
# Model Under Test (MUT) - The model being evaluated
|
||||
MUT_ENDPOINT=http://localhost:11434
|
||||
MUT_API_KEY= # Optional for local endpoints
|
||||
MUT_MODEL=qwen3:4b-q4_K_M
|
||||
|
||||
# Evaluator API - For non-interactive automated scoring
|
||||
EVALUATOR_ENDPOINT=http://localhost:11434
|
||||
EVALUATOR_API_KEY= # Optional
|
||||
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
|
||||
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
|
||||
|
||||
# Execution Mode
|
||||
NON_INTERACTIVE=false # Set to true for automated evaluation
|
||||
TEST_SUITE=test_suite.yaml
|
||||
OUTPUT_DIR=results
|
||||
FILTER_CATEGORY= # Optional: filter by category
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
#### 1. Test a Single Model
|
||||
#### 0. Test Connectivity (Dry Run)
|
||||
|
||||
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
|
||||
|
||||
```bash
|
||||
# For Ollama (default: http://localhost:11434)
|
||||
# Test MUT endpoint connectivity
|
||||
python ai_eval.py --dry-run
|
||||
|
||||
# Test with specific configuration
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
|
||||
|
||||
# Test non-interactive mode (tests both MUT and evaluator endpoints)
|
||||
python ai_eval.py --non-interactive --dry-run
|
||||
|
||||
# Test multiple models
|
||||
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
|
||||
```
|
||||
|
||||
The dry-run mode will:
|
||||
- Test connectivity to the model under test endpoint(s)
|
||||
- Verify authentication (API keys)
|
||||
- Confirm model availability
|
||||
- Test evaluator endpoint if in non-interactive mode
|
||||
- Exit with success/failure status
|
||||
|
||||
#### 1. Interactive Mode (Manual Evaluation)
|
||||
|
||||
```bash
|
||||
# Using .env file
|
||||
python ai_eval.py
|
||||
|
||||
# Or with command-line arguments
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
|
||||
# For other endpoints with API key
|
||||
@@ -69,33 +131,94 @@ python ai_eval.py \
|
||||
--model your-model-name
|
||||
```
|
||||
|
||||
#### 2. Test Multiple Models (Quantization Comparison)
|
||||
#### 2. Non-Interactive Mode (Automated Evaluation)
|
||||
|
||||
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
|
||||
|
||||
```bash
|
||||
# Test different quantizations of qwen3:4b
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
|
||||
# Configure .env file
|
||||
NON_INTERACTIVE=true
|
||||
EVALUATOR_ENDPOINT=http://localhost:11434
|
||||
EVALUATOR_MODEL=qwen3:14b
|
||||
|
||||
# Test different model sizes
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
|
||||
# Run the test
|
||||
python ai_eval.py
|
||||
|
||||
# Or with command-line arguments
|
||||
python ai_eval.py \
|
||||
--endpoint http://localhost:11434 \
|
||||
--model qwen3:4b-q4_K_M \
|
||||
--non-interactive \
|
||||
--evaluator-endpoint http://localhost:11434 \
|
||||
--evaluator-model qwen3:14b
|
||||
```
|
||||
|
||||
#### 3. Filter by Category
|
||||
**How Non-Interactive Mode Works:**
|
||||
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
|
||||
- The evaluator model analyzes the response and returns a score (0-5) with notes
|
||||
- This enables automated, consistent scoring across multiple model runs
|
||||
- The evaluator uses a specialized system prompt designed for objective evaluation
|
||||
|
||||
**Choosing an Evaluator Model:**
|
||||
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
|
||||
- The evaluator model should be more capable than the model under test
|
||||
- Lower temperature (0.3) provides more consistent scoring
|
||||
|
||||
#### 3. Test Multiple Models (Batch Mode)
|
||||
|
||||
Test multiple models in one run by specifying comma-separated model names:
|
||||
|
||||
```bash
|
||||
# In .env file
|
||||
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
||||
|
||||
# Run batch test
|
||||
python ai_eval.py
|
||||
|
||||
# Or via command line
|
||||
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
|
||||
```
|
||||
|
||||
The script will automatically test each model sequentially and save individual results.
|
||||
|
||||
#### 4. Filter by Category
|
||||
|
||||
```bash
|
||||
# Test only IT Forensics categories
|
||||
python ai_eval.py \
|
||||
--endpoint http://localhost:11434 \
|
||||
--model qwen3:4b \
|
||||
--category "IT Forensics - File Systems"
|
||||
python ai_eval.py --category "IT Forensics - File Systems"
|
||||
```
|
||||
|
||||
#### 4. Analyze Results
|
||||
#### 5. Analyze Results
|
||||
|
||||
```bash
|
||||
# Compare all tested models
|
||||
## Analyzing Results
|
||||
|
||||
### Interactive Web Dashboard (Recommended)
|
||||
|
||||
Launch the comprehensive web interface for visual analysis:
|
||||
|
||||
```bash
|
||||
# Start web dashboard (opens automatically in browser)
|
||||
python analyze_results.py --web
|
||||
|
||||
# Custom host/port
|
||||
python analyze_results.py --web --host 0.0.0.0 --port 8080
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- 📊 Visual comparison charts and graphs
|
||||
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
|
||||
- 🔍 Interactive filtering and sorting
|
||||
- 📈 Statistical analysis (consistency, robustness)
|
||||
- 📂 Category and difficulty breakdowns
|
||||
- 💡 Multi-dimensional cognitive evaluation
|
||||
|
||||
See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
|
||||
|
||||
### Command-Line Analysis
|
||||
|
||||
```bash
|
||||
# Compare all models
|
||||
python analyze_results.py --compare
|
||||
|
||||
# Detailed report for specific model
|
||||
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
|
||||
├── ai_eval.py # Main testing script
|
||||
├── analyze_results.py # Results analysis and comparison
|
||||
├── test_suite.yaml # Test definitions
|
||||
├── .env.example # Configuration template
|
||||
├── results/ # Auto-created results directory
|
||||
│ ├── qwen3_4b-q4_K_M_latest.json
|
||||
│ ├── qwen3_4b-q8_0_latest.json
|
||||
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Environment Variables (.env file)
|
||||
|
||||
All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
|
||||
|
||||
#### Model Under Test (MUT)
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
|
||||
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
|
||||
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
|
||||
|
||||
#### Evaluator Configuration (for Non-Interactive Mode)
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
|
||||
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
|
||||
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
|
||||
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
|
||||
|
||||
#### Test Configuration
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
|
||||
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
|
||||
| `OUTPUT_DIR` | Results output directory | `results` |
|
||||
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
|
||||
|
||||
### Command-Line Arguments
|
||||
|
||||
All environment variables have corresponding command-line flags:
|
||||
|
||||
```bash
|
||||
python ai_eval.py --help
|
||||
|
||||
Options:
|
||||
--endpoint ENDPOINT Model under test endpoint
|
||||
--api-key API_KEY Model under test API key
|
||||
--model MODEL Model name to test
|
||||
--test-suite FILE Test suite YAML file
|
||||
--output-dir DIR Output directory
|
||||
--category CATEGORY Filter by category
|
||||
--non-interactive Enable automated evaluation
|
||||
--evaluator-endpoint ENDPOINT Evaluator API endpoint
|
||||
--evaluator-api-key KEY Evaluator API key
|
||||
--evaluator-model MODEL Evaluator model name
|
||||
--evaluator-temperature TEMP Evaluator temperature
|
||||
```
|
||||
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Test Suite
|
||||
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
|
||||
expected_difficulty: "medium" # medium, hard, very_hard
|
||||
```
|
||||
|
||||
### Batch Testing Script
|
||||
### Batch Testing Examples
|
||||
|
||||
Create `batch_test.sh`:
|
||||
Testing multiple models using the `.env` configuration:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Configure .env with multiple models
|
||||
cp .env.example .env
|
||||
nano .env
|
||||
|
||||
ENDPOINT="http://localhost:11434"
|
||||
# Set multiple models (comma-separated)
|
||||
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
||||
|
||||
# Test all qwen3:4b quantizations
|
||||
for quant in q4_K_M q8_0 fp16; do
|
||||
echo "Testing qwen3:4b-${quant}..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
|
||||
done
|
||||
# Run batch tests
|
||||
python ai_eval.py
|
||||
|
||||
# Test all sizes with q4_K_M
|
||||
for size in 4b 8b 14b; do
|
||||
echo "Testing qwen3:${size}-q4_K_M..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
|
||||
done
|
||||
# Or via command line
|
||||
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
|
||||
|
||||
# Generate comparison
|
||||
# Generate comparison after testing
|
||||
python analyze_results.py --compare
|
||||
```
|
||||
|
||||
@@ -244,8 +419,8 @@ python analyze_results.py --compare
|
||||
For OpenAI-compatible cloud services:
|
||||
|
||||
```bash
|
||||
python ai_eval.py \
|
||||
--endpoint https://api.service.com \
|
||||
--api-key your-api-key \
|
||||
--model model-name
|
||||
# In .env file
|
||||
MUT_ENDPOINT=https://api.service.com
|
||||
MUT_API_KEY=your-api-key
|
||||
MUT_MODEL=model-name
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user