Files

2026-01-16 12:48:56 +01:00

12 KiB

Raw Blame History

AI Model Evaluation Framework

Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.

Features

Comprehensive Test Coverage
- Logic & Reasoning
- Mathematics & Calculations
- Instruction Following
- Creative Writing
- Code Generation
- Language Nuance
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
- Multi-turn conversations with context retention
IT Forensics Focus
- Raw hex dump analysis (Master File Table)
- File signature identification
- Registry hive analysis
- FILETIME conversions
- Memory artifact extraction
- TCP/IP header analysis
- Timeline reconstruction
Automated Testing
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
- Interactive evaluation with scoring rubric
- Progress tracking and auto-save
- Multi-turn conversation handling
Analysis & Comparison
- Cross-model comparison reports
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis
- 🌐 Interactive Web Dashboard (New!)
  - Visual analytics with charts and graphs
  - Advanced intelligence metrics
  - Filtering, sorting, and statistical analysis
  - Multi-dimensional performance evaluation

Quick Start

Prerequisites

# Python 3.8+
pip install -r requirements.txt
# or manually:
pip install pyyaml requests python-dotenv

Installation

# Clone or download the files
# Copy the example environment file
cp .env.example .env

# Edit .env with your settings
# - Configure the model under test (MUT_*)
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
# - Set NON_INTERACTIVE=true for automated evaluation
nano .env

Configuration with .env File (Recommended)

The test suite can be configured using a .env file for easier batch testing and non-interactive mode:

# Model Under Test (MUT) - The model being evaluated
MUT_ENDPOINT=http://localhost:11434
MUT_API_KEY=                         # Optional for local endpoints
MUT_MODEL=qwen3:4b-q4_K_M

# Evaluator API - For non-interactive automated scoring
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_API_KEY=                   # Optional
EVALUATOR_MODEL=qwen3:14b           # Use a capable model for evaluation
EVALUATOR_TEMPERATURE=0.3           # Lower = more consistent scoring

# Execution Mode
NON_INTERACTIVE=false               # Set to true for automated evaluation
TEST_SUITE=test_suite.yaml
OUTPUT_DIR=results
FILTER_CATEGORY=                    # Optional: filter by category

Basic Usage

0. Test Connectivity (Dry Run)

Before running the full test suite, verify that your API endpoints are reachable and properly configured:

# Test MUT endpoint connectivity
python ai_eval.py --dry-run

# Test with specific configuration
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run

# Test non-interactive mode (tests both MUT and evaluator endpoints)
python ai_eval.py --non-interactive --dry-run

# Test multiple models
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run

The dry-run mode will:

Test connectivity to the model under test endpoint(s)
Verify authentication (API keys)
Confirm model availability
Test evaluator endpoint if in non-interactive mode
Exit with success/failure status

1. Interactive Mode (Manual Evaluation)

# Using .env file
python ai_eval.py

# Or with command-line arguments
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

# For other endpoints with API key
python ai_eval.py \
  --endpoint https://api.example.com \
  --api-key sk-your-key-here \
  --model your-model-name

2. Non-Interactive Mode (Automated Evaluation)

Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.

# Configure .env file
NON_INTERACTIVE=true
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_MODEL=qwen3:14b

# Run the test
python ai_eval.py

# Or with command-line arguments
python ai_eval.py \
  --endpoint http://localhost:11434 \
  --model qwen3:4b-q4_K_M \
  --non-interactive \
  --evaluator-endpoint http://localhost:11434 \
  --evaluator-model qwen3:14b

How Non-Interactive Mode Works:

For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
The evaluator model analyzes the response and returns a score (0-5) with notes
This enables automated, consistent scoring across multiple model runs
The evaluator uses a specialized system prompt designed for objective evaluation

Choosing an Evaluator Model:

Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
The evaluator model should be more capable than the model under test
Lower temperature (0.3) provides more consistent scoring

3. Test Multiple Models (Batch Mode)

Test multiple models in one run by specifying comma-separated model names:

# In .env file
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

# Run batch test
python ai_eval.py

# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16

The script will automatically test each model sequentially and save individual results.

4. Filter by Category

# Test only IT Forensics categories
python ai_eval.py --category "IT Forensics - File Systems"

5. Analyze Results

## Analyzing Results

### Interactive Web Dashboard (Recommended)

Launch the comprehensive web interface for visual analysis:

```bash
# Start web dashboard (opens automatically in browser)
python analyze_results.py --web

# Custom host/port
python analyze_results.py --web --host 0.0.0.0 --port 8080

Features:

📊 Visual comparison charts and graphs
🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
🔍 Interactive filtering and sorting
📈 Statistical analysis (consistency, robustness)
📂 Category and difficulty breakdowns
💡 Multi-dimensional cognitive evaluation

See WEB_INTERFACE.md for detailed documentation.

Command-Line Analysis

# Compare all models
python analyze_results.py --compare

# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"

# Export to CSV
python analyze_results.py --export comparison.csv

Scoring Rubric

All tests are evaluated on a 0-5 scale:

Score	Category	Description
0-1	FAIL	Major errors, fails to meet basic requirements
2-3	PASS	Meets requirements with minor issues
4-5	EXCEPTIONAL	Exceeds requirements, demonstrates deep understanding

Evaluation Criteria

Constraint Adherence

Fail: Misses more than one constraint or forbidden word
Pass: Follows all constraints but flow is awkward
Exceptional: Follows all constraints with natural, fluid language

Unit Precision (for math/forensics)

Fail: Errors in basic conversion
Pass: Correct conversions but rounding errors
Exceptional: Perfect precision across systems

Reasoning Path

Fail: Gives only final answer without steps
Pass: Shows steps but logic contains "leaps"
Exceptional: Transparent, logical chain-of-thought

Code Safety

Fail: Function crashes on bad input
Pass: Logic correct but lacks error handling
Exceptional: Production-ready with robust error catching

Test Categories Overview

General Reasoning (14 tests)

Logic puzzles & temporal reasoning
Multi-step mathematics
Strict instruction following
Creative writing with constraints
Code generation
Language nuance understanding
Problem-solving & logistics

IT Forensics (8 tests)

File Systems

MFT Basic Analysis: Signature, status flags, sequence numbers
MFT Advanced: Update sequence arrays, LSN, attribute offsets
File Signatures: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)

Registry & Artifacts

Registry Hive Headers: Signature, sequence numbers, format version
FILETIME Conversion: Windows timestamp decoding

Memory & Network

Memory Artifacts: HTTP request extraction from dumps
TCP Headers: Port, sequence, flags, window size analysis

Timeline Analysis

Event Reconstruction: Log correlation, attack narrative building

Multi-turn Conversations (3 tests)

Progressive hex analysis (PE file structure)
Forensic investigation scenario
Technical depth building (NTFS ADS)

File Structure

.
├── ai_eval.py              # Main testing script
├── analyze_results.py      # Results analysis and comparison
├── test_suite.yaml         # Test definitions
├── .env.example            # Configuration template
├── results/                # Auto-created results directory
│   ├── qwen3_4b-q4_K_M_latest.json
│   ├── qwen3_4b-q8_0_latest.json
│   └── qwen3_4b-fp16_latest.json
└── README.md

Configuration Reference

Environment Variables (.env file)

All configuration can be set via .env file or command-line arguments. Command-line arguments override .env values.

Model Under Test (MUT)

Variable	Description	Example
`MUT_ENDPOINT`	API endpoint for model under test	`http://localhost:11434`
`MUT_API_KEY`	API key (optional for local endpoints)	`sk-...`
`MUT_MODEL`	Model name/identifier	`qwen3:4b-q4_K_M`

Evaluator Configuration (for Non-Interactive Mode)

Variable	Description	Example
`EVALUATOR_ENDPOINT`	API endpoint for evaluator model	`http://localhost:11434`
`EVALUATOR_API_KEY`	API key for evaluator	`sk-...`
`EVALUATOR_MODEL`	Evaluator model name	`qwen3:14b`
`EVALUATOR_TEMPERATURE`	Temperature for evaluator (lower = more consistent)	`0.3`

Test Configuration

Variable	Description	Example
`NON_INTERACTIVE`	Enable automated evaluation	`true` or `false`
`TEST_SUITE`	Path to test suite YAML file	`test_suite.yaml`
`OUTPUT_DIR`	Results output directory	`results`
`FILTER_CATEGORY`	Filter tests by category (optional)	`IT Forensics - File Systems`

Command-Line Arguments

All environment variables have corresponding command-line flags:

python ai_eval.py --help

Options:
  --endpoint ENDPOINT                Model under test endpoint
  --api-key API_KEY                  Model under test API key
  --model MODEL                      Model name to test
  --test-suite FILE                  Test suite YAML file
  --output-dir DIR                   Output directory
  --category CATEGORY                Filter by category
  --non-interactive                  Enable automated evaluation
  --evaluator-endpoint ENDPOINT      Evaluator API endpoint
  --evaluator-api-key KEY            Evaluator API key
  --evaluator-model MODEL            Evaluator model name
  --evaluator-temperature TEMP       Evaluator temperature

Advanced Usage

Custom Test Suite

Edit test_suite.yaml to add your own tests:

- category: "Your Category"
  tests:
    - id: "custom_01"
      name: "Your Test Name"
      type: "single_turn"  # or "multi_turn"
      prompt: "Your test prompt here"
      evaluation_criteria:
        - "Criterion 1"
        - "Criterion 2"
      expected_difficulty: "medium"  # medium, hard, very_hard

Batch Testing Examples

Testing multiple models using the .env configuration:

# Configure .env with multiple models
cp .env.example .env
nano .env

# Set multiple models (comma-separated)
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

# Run batch tests
python ai_eval.py

# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M

# Generate comparison after testing
python analyze_results.py --compare

Custom Endpoint Configuration

For OpenAI-compatible cloud services:

# In .env file
MUT_ENDPOINT=https://api.service.com
MUT_API_KEY=your-api-key
MUT_MODEL=model-name

12 KiB Raw Blame History