12 KiB
AI Model Evaluation Framework
Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
Features
-
Comprehensive Test Coverage
- Logic & Reasoning
- Mathematics & Calculations
- Instruction Following
- Creative Writing
- Code Generation
- Language Nuance
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
- Multi-turn conversations with context retention
-
IT Forensics Focus
- Raw hex dump analysis (Master File Table)
- File signature identification
- Registry hive analysis
- FILETIME conversions
- Memory artifact extraction
- TCP/IP header analysis
- Timeline reconstruction
-
Automated Testing
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
- Interactive evaluation with scoring rubric
- Progress tracking and auto-save
- Multi-turn conversation handling
-
Analysis & Comparison
- Cross-model comparison reports
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis
- 🌐 Interactive Web Dashboard (New!)
- Visual analytics with charts and graphs
- Advanced intelligence metrics
- Filtering, sorting, and statistical analysis
- Multi-dimensional performance evaluation
Quick Start
Prerequisites
# Python 3.8+
pip install -r requirements.txt
# or manually:
pip install pyyaml requests python-dotenv
Installation
# Clone or download the files
# Copy the example environment file
cp .env.example .env
# Edit .env with your settings
# - Configure the model under test (MUT_*)
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
# - Set NON_INTERACTIVE=true for automated evaluation
nano .env
Configuration with .env File (Recommended)
The test suite can be configured using a .env file for easier batch testing and non-interactive mode:
# Model Under Test (MUT) - The model being evaluated
MUT_ENDPOINT=http://localhost:11434
MUT_API_KEY= # Optional for local endpoints
MUT_MODEL=qwen3:4b-q4_K_M
# Evaluator API - For non-interactive automated scoring
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_API_KEY= # Optional
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
# Execution Mode
NON_INTERACTIVE=false # Set to true for automated evaluation
TEST_SUITE=test_suite.yaml
OUTPUT_DIR=results
FILTER_CATEGORY= # Optional: filter by category
Basic Usage
0. Test Connectivity (Dry Run)
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
# Test MUT endpoint connectivity
python ai_eval.py --dry-run
# Test with specific configuration
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
# Test non-interactive mode (tests both MUT and evaluator endpoints)
python ai_eval.py --non-interactive --dry-run
# Test multiple models
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
The dry-run mode will:
- Test connectivity to the model under test endpoint(s)
- Verify authentication (API keys)
- Confirm model availability
- Test evaluator endpoint if in non-interactive mode
- Exit with success/failure status
1. Interactive Mode (Manual Evaluation)
# Using .env file
python ai_eval.py
# Or with command-line arguments
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
# For other endpoints with API key
python ai_eval.py \
--endpoint https://api.example.com \
--api-key sk-your-key-here \
--model your-model-name
2. Non-Interactive Mode (Automated Evaluation)
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
# Configure .env file
NON_INTERACTIVE=true
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_MODEL=qwen3:14b
# Run the test
python ai_eval.py
# Or with command-line arguments
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b-q4_K_M \
--non-interactive \
--evaluator-endpoint http://localhost:11434 \
--evaluator-model qwen3:14b
How Non-Interactive Mode Works:
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
- The evaluator model analyzes the response and returns a score (0-5) with notes
- This enables automated, consistent scoring across multiple model runs
- The evaluator uses a specialized system prompt designed for objective evaluation
Choosing an Evaluator Model:
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
- The evaluator model should be more capable than the model under test
- Lower temperature (0.3) provides more consistent scoring
3. Test Multiple Models (Batch Mode)
Test multiple models in one run by specifying comma-separated model names:
# In .env file
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Run batch test
python ai_eval.py
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
The script will automatically test each model sequentially and save individual results.
4. Filter by Category
# Test only IT Forensics categories
python ai_eval.py --category "IT Forensics - File Systems"
5. Analyze Results
## Analyzing Results
### Interactive Web Dashboard (Recommended)
Launch the comprehensive web interface for visual analysis:
```bash
# Start web dashboard (opens automatically in browser)
python analyze_results.py --web
# Custom host/port
python analyze_results.py --web --host 0.0.0.0 --port 8080
Features:
- 📊 Visual comparison charts and graphs
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
- 🔍 Interactive filtering and sorting
- 📈 Statistical analysis (consistency, robustness)
- 📂 Category and difficulty breakdowns
- 💡 Multi-dimensional cognitive evaluation
See WEB_INTERFACE.md for detailed documentation.
Command-Line Analysis
# Compare all models
python analyze_results.py --compare
# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"
# Export to CSV
python analyze_results.py --export comparison.csv
Scoring Rubric
All tests are evaluated on a 0-5 scale:
| Score | Category | Description |
|---|---|---|
| 0-1 | FAIL | Major errors, fails to meet basic requirements |
| 2-3 | PASS | Meets requirements with minor issues |
| 4-5 | EXCEPTIONAL | Exceeds requirements, demonstrates deep understanding |
Evaluation Criteria
Constraint Adherence
- Fail: Misses more than one constraint or forbidden word
- Pass: Follows all constraints but flow is awkward
- Exceptional: Follows all constraints with natural, fluid language
Unit Precision (for math/forensics)
- Fail: Errors in basic conversion
- Pass: Correct conversions but rounding errors
- Exceptional: Perfect precision across systems
Reasoning Path
- Fail: Gives only final answer without steps
- Pass: Shows steps but logic contains "leaps"
- Exceptional: Transparent, logical chain-of-thought
Code Safety
- Fail: Function crashes on bad input
- Pass: Logic correct but lacks error handling
- Exceptional: Production-ready with robust error catching
Test Categories Overview
General Reasoning (14 tests)
- Logic puzzles & temporal reasoning
- Multi-step mathematics
- Strict instruction following
- Creative writing with constraints
- Code generation
- Language nuance understanding
- Problem-solving & logistics
IT Forensics (8 tests)
File Systems
- MFT Basic Analysis: Signature, status flags, sequence numbers
- MFT Advanced: Update sequence arrays, LSN, attribute offsets
- File Signatures: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
Registry & Artifacts
- Registry Hive Headers: Signature, sequence numbers, format version
- FILETIME Conversion: Windows timestamp decoding
Memory & Network
- Memory Artifacts: HTTP request extraction from dumps
- TCP Headers: Port, sequence, flags, window size analysis
Timeline Analysis
- Event Reconstruction: Log correlation, attack narrative building
Multi-turn Conversations (3 tests)
- Progressive hex analysis (PE file structure)
- Forensic investigation scenario
- Technical depth building (NTFS ADS)
File Structure
.
├── ai_eval.py # Main testing script
├── analyze_results.py # Results analysis and comparison
├── test_suite.yaml # Test definitions
├── .env.example # Configuration template
├── results/ # Auto-created results directory
│ ├── qwen3_4b-q4_K_M_latest.json
│ ├── qwen3_4b-q8_0_latest.json
│ └── qwen3_4b-fp16_latest.json
└── README.md
Configuration Reference
Environment Variables (.env file)
All configuration can be set via .env file or command-line arguments. Command-line arguments override .env values.
Model Under Test (MUT)
| Variable | Description | Example |
|---|---|---|
MUT_ENDPOINT |
API endpoint for model under test | http://localhost:11434 |
MUT_API_KEY |
API key (optional for local endpoints) | sk-... |
MUT_MODEL |
Model name/identifier | qwen3:4b-q4_K_M |
Evaluator Configuration (for Non-Interactive Mode)
| Variable | Description | Example |
|---|---|---|
EVALUATOR_ENDPOINT |
API endpoint for evaluator model | http://localhost:11434 |
EVALUATOR_API_KEY |
API key for evaluator | sk-... |
EVALUATOR_MODEL |
Evaluator model name | qwen3:14b |
EVALUATOR_TEMPERATURE |
Temperature for evaluator (lower = more consistent) | 0.3 |
Test Configuration
| Variable | Description | Example |
|---|---|---|
NON_INTERACTIVE |
Enable automated evaluation | true or false |
TEST_SUITE |
Path to test suite YAML file | test_suite.yaml |
OUTPUT_DIR |
Results output directory | results |
FILTER_CATEGORY |
Filter tests by category (optional) | IT Forensics - File Systems |
Command-Line Arguments
All environment variables have corresponding command-line flags:
python ai_eval.py --help
Options:
--endpoint ENDPOINT Model under test endpoint
--api-key API_KEY Model under test API key
--model MODEL Model name to test
--test-suite FILE Test suite YAML file
--output-dir DIR Output directory
--category CATEGORY Filter by category
--non-interactive Enable automated evaluation
--evaluator-endpoint ENDPOINT Evaluator API endpoint
--evaluator-api-key KEY Evaluator API key
--evaluator-model MODEL Evaluator model name
--evaluator-temperature TEMP Evaluator temperature
Advanced Usage
Custom Test Suite
Edit test_suite.yaml to add your own tests:
- category: "Your Category"
tests:
- id: "custom_01"
name: "Your Test Name"
type: "single_turn" # or "multi_turn"
prompt: "Your test prompt here"
evaluation_criteria:
- "Criterion 1"
- "Criterion 2"
expected_difficulty: "medium" # medium, hard, very_hard
Batch Testing Examples
Testing multiple models using the .env configuration:
# Configure .env with multiple models
cp .env.example .env
nano .env
# Set multiple models (comma-separated)
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Run batch tests
python ai_eval.py
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
# Generate comparison after testing
python analyze_results.py --compare
Custom Endpoint Configuration
For OpenAI-compatible cloud services:
# In .env file
MUT_ENDPOINT=https://api.service.com
MUT_API_KEY=your-api-key
MUT_MODEL=model-name