427 lines
12 KiB
Markdown
427 lines
12 KiB
Markdown
# AI Model Evaluation Framework
|
|
|
|
Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
|
|
|
|
## Features
|
|
|
|
- **Comprehensive Test Coverage**
|
|
- Logic & Reasoning
|
|
- Mathematics & Calculations
|
|
- Instruction Following
|
|
- Creative Writing
|
|
- Code Generation
|
|
- Language Nuance
|
|
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
|
|
- Multi-turn conversations with context retention
|
|
|
|
- **IT Forensics Focus**
|
|
- Raw hex dump analysis (Master File Table)
|
|
- File signature identification
|
|
- Registry hive analysis
|
|
- FILETIME conversions
|
|
- Memory artifact extraction
|
|
- TCP/IP header analysis
|
|
- Timeline reconstruction
|
|
|
|
- **Automated Testing**
|
|
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
|
|
- Interactive evaluation with scoring rubric
|
|
- Progress tracking and auto-save
|
|
- Multi-turn conversation handling
|
|
|
|
- **Analysis & Comparison**
|
|
- Cross-model comparison reports
|
|
- Category-wise performance breakdown
|
|
- Difficulty-based analysis
|
|
- CSV export for further analysis
|
|
- **🌐 Interactive Web Dashboard** (New!)
|
|
- Visual analytics with charts and graphs
|
|
- Advanced intelligence metrics
|
|
- Filtering, sorting, and statistical analysis
|
|
- Multi-dimensional performance evaluation
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Python 3.8+
|
|
pip install -r requirements.txt
|
|
# or manually:
|
|
pip install pyyaml requests python-dotenv
|
|
```
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone or download the files
|
|
# Copy the example environment file
|
|
cp .env.example .env
|
|
|
|
# Edit .env with your settings
|
|
# - Configure the model under test (MUT_*)
|
|
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
|
|
# - Set NON_INTERACTIVE=true for automated evaluation
|
|
nano .env
|
|
```
|
|
|
|
### Configuration with .env File (Recommended)
|
|
|
|
The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
|
|
|
|
```bash
|
|
# Model Under Test (MUT) - The model being evaluated
|
|
MUT_ENDPOINT=http://localhost:11434
|
|
MUT_API_KEY= # Optional for local endpoints
|
|
MUT_MODEL=qwen3:4b-q4_K_M
|
|
|
|
# Evaluator API - For non-interactive automated scoring
|
|
EVALUATOR_ENDPOINT=http://localhost:11434
|
|
EVALUATOR_API_KEY= # Optional
|
|
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
|
|
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
|
|
|
|
# Execution Mode
|
|
NON_INTERACTIVE=false # Set to true for automated evaluation
|
|
TEST_SUITE=test_suite.yaml
|
|
OUTPUT_DIR=results
|
|
FILTER_CATEGORY= # Optional: filter by category
|
|
```
|
|
|
|
### Basic Usage
|
|
|
|
#### 0. Test Connectivity (Dry Run)
|
|
|
|
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
|
|
|
|
```bash
|
|
# Test MUT endpoint connectivity
|
|
python ai_eval.py --dry-run
|
|
|
|
# Test with specific configuration
|
|
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
|
|
|
|
# Test non-interactive mode (tests both MUT and evaluator endpoints)
|
|
python ai_eval.py --non-interactive --dry-run
|
|
|
|
# Test multiple models
|
|
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
|
|
```
|
|
|
|
The dry-run mode will:
|
|
- Test connectivity to the model under test endpoint(s)
|
|
- Verify authentication (API keys)
|
|
- Confirm model availability
|
|
- Test evaluator endpoint if in non-interactive mode
|
|
- Exit with success/failure status
|
|
|
|
#### 1. Interactive Mode (Manual Evaluation)
|
|
|
|
```bash
|
|
# Using .env file
|
|
python ai_eval.py
|
|
|
|
# Or with command-line arguments
|
|
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
|
|
|
# For other endpoints with API key
|
|
python ai_eval.py \
|
|
--endpoint https://api.example.com \
|
|
--api-key sk-your-key-here \
|
|
--model your-model-name
|
|
```
|
|
|
|
#### 2. Non-Interactive Mode (Automated Evaluation)
|
|
|
|
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
|
|
|
|
```bash
|
|
# Configure .env file
|
|
NON_INTERACTIVE=true
|
|
EVALUATOR_ENDPOINT=http://localhost:11434
|
|
EVALUATOR_MODEL=qwen3:14b
|
|
|
|
# Run the test
|
|
python ai_eval.py
|
|
|
|
# Or with command-line arguments
|
|
python ai_eval.py \
|
|
--endpoint http://localhost:11434 \
|
|
--model qwen3:4b-q4_K_M \
|
|
--non-interactive \
|
|
--evaluator-endpoint http://localhost:11434 \
|
|
--evaluator-model qwen3:14b
|
|
```
|
|
|
|
**How Non-Interactive Mode Works:**
|
|
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
|
|
- The evaluator model analyzes the response and returns a score (0-5) with notes
|
|
- This enables automated, consistent scoring across multiple model runs
|
|
- The evaluator uses a specialized system prompt designed for objective evaluation
|
|
|
|
**Choosing an Evaluator Model:**
|
|
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
|
|
- The evaluator model should be more capable than the model under test
|
|
- Lower temperature (0.3) provides more consistent scoring
|
|
|
|
#### 3. Test Multiple Models (Batch Mode)
|
|
|
|
Test multiple models in one run by specifying comma-separated model names:
|
|
|
|
```bash
|
|
# In .env file
|
|
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
|
|
|
# Run batch test
|
|
python ai_eval.py
|
|
|
|
# Or via command line
|
|
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
|
|
```
|
|
|
|
The script will automatically test each model sequentially and save individual results.
|
|
|
|
#### 4. Filter by Category
|
|
|
|
```bash
|
|
# Test only IT Forensics categories
|
|
python ai_eval.py --category "IT Forensics - File Systems"
|
|
```
|
|
|
|
#### 5. Analyze Results
|
|
|
|
```bash
|
|
## Analyzing Results
|
|
|
|
### Interactive Web Dashboard (Recommended)
|
|
|
|
Launch the comprehensive web interface for visual analysis:
|
|
|
|
```bash
|
|
# Start web dashboard (opens automatically in browser)
|
|
python analyze_results.py --web
|
|
|
|
# Custom host/port
|
|
python analyze_results.py --web --host 0.0.0.0 --port 8080
|
|
```
|
|
|
|
**Features:**
|
|
- 📊 Visual comparison charts and graphs
|
|
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
|
|
- 🔍 Interactive filtering and sorting
|
|
- 📈 Statistical analysis (consistency, robustness)
|
|
- 📂 Category and difficulty breakdowns
|
|
- 💡 Multi-dimensional cognitive evaluation
|
|
|
|
See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
|
|
|
|
### Command-Line Analysis
|
|
|
|
```bash
|
|
# Compare all models
|
|
python analyze_results.py --compare
|
|
|
|
# Detailed report for specific model
|
|
python analyze_results.py --detail "qwen3:4b-q4_K_M"
|
|
|
|
# Export to CSV
|
|
python analyze_results.py --export comparison.csv
|
|
```
|
|
|
|
## Scoring Rubric
|
|
|
|
All tests are evaluated on a 0-5 scale:
|
|
|
|
| Score | Category | Description |
|
|
|-------|----------|-------------|
|
|
| 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
|
|
| 2-3 | **PASS** | Meets requirements with minor issues |
|
|
| 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |
|
|
|
|
### Evaluation Criteria
|
|
|
|
#### Constraint Adherence
|
|
|
|
- Fail: Misses more than one constraint or forbidden word
|
|
- Pass: Follows all constraints but flow is awkward
|
|
- Exceptional: Follows all constraints with natural, fluid language
|
|
|
|
#### Unit Precision (for math/forensics)
|
|
|
|
- Fail: Errors in basic conversion
|
|
- Pass: Correct conversions but rounding errors
|
|
- Exceptional: Perfect precision across systems
|
|
|
|
#### Reasoning Path
|
|
|
|
- Fail: Gives only final answer without steps
|
|
- Pass: Shows steps but logic contains "leaps"
|
|
- Exceptional: Transparent, logical chain-of-thought
|
|
|
|
#### Code Safety
|
|
|
|
- Fail: Function crashes on bad input
|
|
- Pass: Logic correct but lacks error handling
|
|
- Exceptional: Production-ready with robust error catching
|
|
|
|
## Test Categories Overview
|
|
|
|
### General Reasoning (14 tests)
|
|
|
|
- Logic puzzles & temporal reasoning
|
|
- Multi-step mathematics
|
|
- Strict instruction following
|
|
- Creative writing with constraints
|
|
- Code generation
|
|
- Language nuance understanding
|
|
- Problem-solving & logistics
|
|
|
|
### IT Forensics (8 tests)
|
|
|
|
#### File Systems
|
|
|
|
- **MFT Basic Analysis**: Signature, status flags, sequence numbers
|
|
- **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
|
|
- **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
|
|
|
|
#### Registry & Artifacts
|
|
|
|
- **Registry Hive Headers**: Signature, sequence numbers, format version
|
|
- **FILETIME Conversion**: Windows timestamp decoding
|
|
|
|
#### Memory & Network
|
|
|
|
- **Memory Artifacts**: HTTP request extraction from dumps
|
|
- **TCP Headers**: Port, sequence, flags, window size analysis
|
|
|
|
#### Timeline Analysis
|
|
|
|
- **Event Reconstruction**: Log correlation, attack narrative building
|
|
|
|
### Multi-turn Conversations (3 tests)
|
|
|
|
- Progressive hex analysis (PE file structure)
|
|
- Forensic investigation scenario
|
|
- Technical depth building (NTFS ADS)
|
|
|
|
## File Structure
|
|
|
|
```bash
|
|
.
|
|
├── ai_eval.py # Main testing script
|
|
├── analyze_results.py # Results analysis and comparison
|
|
├── test_suite.yaml # Test definitions
|
|
├── .env.example # Configuration template
|
|
├── results/ # Auto-created results directory
|
|
│ ├── qwen3_4b-q4_K_M_latest.json
|
|
│ ├── qwen3_4b-q8_0_latest.json
|
|
│ └── qwen3_4b-fp16_latest.json
|
|
└── README.md
|
|
```
|
|
|
|
## Configuration Reference
|
|
|
|
### Environment Variables (.env file)
|
|
|
|
All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
|
|
|
|
#### Model Under Test (MUT)
|
|
|
|
| Variable | Description | Example |
|
|
| --- | --- | --- |
|
|
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
|
|
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
|
|
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
|
|
|
|
#### Evaluator Configuration (for Non-Interactive Mode)
|
|
|
|
| Variable | Description | Example |
|
|
| --- | --- | --- |
|
|
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
|
|
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
|
|
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
|
|
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
|
|
|
|
#### Test Configuration
|
|
|
|
| Variable | Description | Example |
|
|
| --- | --- | --- |
|
|
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
|
|
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
|
|
| `OUTPUT_DIR` | Results output directory | `results` |
|
|
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
|
|
|
|
### Command-Line Arguments
|
|
|
|
All environment variables have corresponding command-line flags:
|
|
|
|
```bash
|
|
python ai_eval.py --help
|
|
|
|
Options:
|
|
--endpoint ENDPOINT Model under test endpoint
|
|
--api-key API_KEY Model under test API key
|
|
--model MODEL Model name to test
|
|
--test-suite FILE Test suite YAML file
|
|
--output-dir DIR Output directory
|
|
--category CATEGORY Filter by category
|
|
--non-interactive Enable automated evaluation
|
|
--evaluator-endpoint ENDPOINT Evaluator API endpoint
|
|
--evaluator-api-key KEY Evaluator API key
|
|
--evaluator-model MODEL Evaluator model name
|
|
--evaluator-temperature TEMP Evaluator temperature
|
|
```
|
|
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom Test Suite
|
|
|
|
Edit `test_suite.yaml` to add your own tests:
|
|
|
|
```yaml
|
|
- category: "Your Category"
|
|
tests:
|
|
- id: "custom_01"
|
|
name: "Your Test Name"
|
|
type: "single_turn" # or "multi_turn"
|
|
prompt: "Your test prompt here"
|
|
evaluation_criteria:
|
|
- "Criterion 1"
|
|
- "Criterion 2"
|
|
expected_difficulty: "medium" # medium, hard, very_hard
|
|
```
|
|
|
|
### Batch Testing Examples
|
|
|
|
Testing multiple models using the `.env` configuration:
|
|
|
|
```bash
|
|
# Configure .env with multiple models
|
|
cp .env.example .env
|
|
nano .env
|
|
|
|
# Set multiple models (comma-separated)
|
|
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
|
|
|
# Run batch tests
|
|
python ai_eval.py
|
|
|
|
# Or via command line
|
|
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
|
|
|
|
# Generate comparison after testing
|
|
python analyze_results.py --compare
|
|
```
|
|
|
|
### Custom Endpoint Configuration
|
|
|
|
For OpenAI-compatible cloud services:
|
|
|
|
```bash
|
|
# In .env file
|
|
MUT_ENDPOINT=https://api.service.com
|
|
MUT_API_KEY=your-api-key
|
|
MUT_MODEL=model-name
|
|
```
|