# AI Model Evaluation Framework

Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.

## Features

- **Comprehensive Test Coverage**
  - Logic & Reasoning
  - Mathematics & Calculations
  - Instruction Following
  - Creative Writing
  - Code Generation
  - Language Nuance
  - IT Forensics (MFT analysis, file signatures, registry, memory, network)
  - Multi-turn conversations with context retention

- **IT Forensics Focus**
  - Raw hex dump analysis (Master File Table)
  - File signature identification
  - Registry hive analysis
  - FILETIME conversions
  - Memory artifact extraction
  - TCP/IP header analysis
  - Timeline reconstruction

- **Automated Testing**
  - OpenAI-compatible API support (Ollama, LM Studio, etc.)
  - Interactive evaluation with scoring rubric
  - Progress tracking and auto-save
  - Multi-turn conversation handling

- **Analysis & Comparison**
  - Cross-model comparison reports
  - Category-wise performance breakdown
  - Difficulty-based analysis
  - CSV export for further analysis
  - **🌐 Interactive Web Dashboard** (New!)
    - Visual analytics with charts and graphs
    - Advanced intelligence metrics
    - Filtering, sorting, and statistical analysis
    - Multi-dimensional performance evaluation

## Quick Start

### Prerequisites

```bash
# Python 3.8+
pip install -r requirements.txt
# or manually:
pip install pyyaml requests python-dotenv
```

### Installation

```bash
# Clone or download the files
# Copy the example environment file
cp .env.example .env

# Edit .env with your settings
# - Configure the model under test (MUT_*)
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
# - Set NON_INTERACTIVE=true for automated evaluation
nano .env
```

### Configuration with .env File (Recommended)

The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:

```bash
# Model Under Test (MUT) - The model being evaluated
MUT_ENDPOINT=http://localhost:11434
MUT_API_KEY=                         # Optional for local endpoints
MUT_MODEL=qwen3:4b-q4_K_M

# Evaluator API - For non-interactive automated scoring
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_API_KEY=                   # Optional
EVALUATOR_MODEL=qwen3:14b           # Use a capable model for evaluation
EVALUATOR_TEMPERATURE=0.3           # Lower = more consistent scoring

# Execution Mode
NON_INTERACTIVE=false               # Set to true for automated evaluation
TEST_SUITE=test_suite.yaml
OUTPUT_DIR=results
FILTER_CATEGORY=                    # Optional: filter by category
```

### Basic Usage

#### 0. Test Connectivity (Dry Run)

Before running the full test suite, verify that your API endpoints are reachable and properly configured:

```bash
# Test MUT endpoint connectivity
python ai_eval.py --dry-run

# Test with specific configuration
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run

# Test non-interactive mode (tests both MUT and evaluator endpoints)
python ai_eval.py --non-interactive --dry-run

# Test multiple models
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
```

The dry-run mode will:
- Test connectivity to the model under test endpoint(s)
- Verify authentication (API keys)
- Confirm model availability
- Test evaluator endpoint if in non-interactive mode
- Exit with success/failure status

#### 1. Interactive Mode (Manual Evaluation)

```bash
# Using .env file
python ai_eval.py

# Or with command-line arguments
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

# For other endpoints with API key
python ai_eval.py \
  --endpoint https://api.example.com \
  --api-key sk-your-key-here \
  --model your-model-name
```

#### 2. Non-Interactive Mode (Automated Evaluation)

Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.

```bash
# Configure .env file
NON_INTERACTIVE=true
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_MODEL=qwen3:14b

# Run the test
python ai_eval.py

# Or with command-line arguments
python ai_eval.py \
  --endpoint http://localhost:11434 \
  --model qwen3:4b-q4_K_M \
  --non-interactive \
  --evaluator-endpoint http://localhost:11434 \
  --evaluator-model qwen3:14b
```

**How Non-Interactive Mode Works:**
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
- The evaluator model analyzes the response and returns a score (0-5) with notes
- This enables automated, consistent scoring across multiple model runs
- The evaluator uses a specialized system prompt designed for objective evaluation

**Choosing an Evaluator Model:**
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
- The evaluator model should be more capable than the model under test
- Lower temperature (0.3) provides more consistent scoring

#### 3. Test Multiple Models (Batch Mode)

Test multiple models in one run by specifying comma-separated model names:

```bash
# In .env file
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

# Run batch test
python ai_eval.py

# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
```

The script will automatically test each model sequentially and save individual results.

#### 4. Filter by Category

```bash
# Test only IT Forensics categories
python ai_eval.py --category "IT Forensics - File Systems"
```

#### 5. Analyze Results

```bash
## Analyzing Results

### Interactive Web Dashboard (Recommended)

Launch the comprehensive web interface for visual analysis:

```bash
# Start web dashboard (opens automatically in browser)
python analyze_results.py --web

# Custom host/port
python analyze_results.py --web --host 0.0.0.0 --port 8080
```

**Features:**
- 📊 Visual comparison charts and graphs
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
- 🔍 Interactive filtering and sorting
- 📈 Statistical analysis (consistency, robustness)
- 📂 Category and difficulty breakdowns
- 💡 Multi-dimensional cognitive evaluation

See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.

### Command-Line Analysis

```bash
# Compare all models
python analyze_results.py --compare

# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"

# Export to CSV
python analyze_results.py --export comparison.csv
```

## Scoring Rubric

All tests are evaluated on a 0-5 scale:

| Score | Category | Description |
|-------|----------|-------------|
| 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
| 2-3 | **PASS** | Meets requirements with minor issues |
| 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |

### Evaluation Criteria

#### Constraint Adherence

- Fail: Misses more than one constraint or forbidden word
- Pass: Follows all constraints but flow is awkward
- Exceptional: Follows all constraints with natural, fluid language

#### Unit Precision (for math/forensics)

- Fail: Errors in basic conversion
- Pass: Correct conversions but rounding errors
- Exceptional: Perfect precision across systems

#### Reasoning Path

- Fail: Gives only final answer without steps
- Pass: Shows steps but logic contains "leaps"
- Exceptional: Transparent, logical chain-of-thought

#### Code Safety

- Fail: Function crashes on bad input
- Pass: Logic correct but lacks error handling
- Exceptional: Production-ready with robust error catching

## Test Categories Overview

### General Reasoning (14 tests)

- Logic puzzles & temporal reasoning
- Multi-step mathematics
- Strict instruction following
- Creative writing with constraints
- Code generation
- Language nuance understanding
- Problem-solving & logistics

### IT Forensics (8 tests)

#### File Systems

- **MFT Basic Analysis**: Signature, status flags, sequence numbers
- **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
- **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)

#### Registry & Artifacts

- **Registry Hive Headers**: Signature, sequence numbers, format version
- **FILETIME Conversion**: Windows timestamp decoding

#### Memory & Network

- **Memory Artifacts**: HTTP request extraction from dumps
- **TCP Headers**: Port, sequence, flags, window size analysis

#### Timeline Analysis

- **Event Reconstruction**: Log correlation, attack narrative building

### Multi-turn Conversations (3 tests)

- Progressive hex analysis (PE file structure)
- Forensic investigation scenario
- Technical depth building (NTFS ADS)

## File Structure

```bash
.
├── ai_eval.py              # Main testing script
├── analyze_results.py      # Results analysis and comparison
├── test_suite.yaml         # Test definitions
├── .env.example            # Configuration template
├── results/                # Auto-created results directory
│   ├── qwen3_4b-q4_K_M_latest.json
│   ├── qwen3_4b-q8_0_latest.json
│   └── qwen3_4b-fp16_latest.json
└── README.md
```

## Configuration Reference

### Environment Variables (.env file)

All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.

#### Model Under Test (MUT)

| Variable | Description | Example |
| --- | --- | --- |
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |

#### Evaluator Configuration (for Non-Interactive Mode)

| Variable | Description | Example |
| --- | --- | --- |
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |

#### Test Configuration

| Variable | Description | Example |
| --- | --- | --- |
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
| `OUTPUT_DIR` | Results output directory | `results` |
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |

### Command-Line Arguments

All environment variables have corresponding command-line flags:

```bash
python ai_eval.py --help

Options:
  --endpoint ENDPOINT                Model under test endpoint
  --api-key API_KEY                  Model under test API key
  --model MODEL                      Model name to test
  --test-suite FILE                  Test suite YAML file
  --output-dir DIR                   Output directory
  --category CATEGORY                Filter by category
  --non-interactive                  Enable automated evaluation
  --evaluator-endpoint ENDPOINT      Evaluator API endpoint
  --evaluator-api-key KEY            Evaluator API key
  --evaluator-model MODEL            Evaluator model name
  --evaluator-temperature TEMP       Evaluator temperature
```


## Advanced Usage

### Custom Test Suite

Edit `test_suite.yaml` to add your own tests:

```yaml
- category: "Your Category"
  tests:
    - id: "custom_01"
      name: "Your Test Name"
      type: "single_turn"  # or "multi_turn"
      prompt: "Your test prompt here"
      evaluation_criteria:
        - "Criterion 1"
        - "Criterion 2"
      expected_difficulty: "medium"  # medium, hard, very_hard
```

### Batch Testing Examples

Testing multiple models using the `.env` configuration:

```bash
# Configure .env with multiple models
cp .env.example .env
nano .env

# Set multiple models (comma-separated)
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

# Run batch tests
python ai_eval.py

# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M

# Generate comparison after testing
python analyze_results.py --compare
```

### Custom Endpoint Configuration

For OpenAI-compatible cloud services:

```bash
# In .env file
MUT_ENDPOINT=https://api.service.com
MUT_API_KEY=your-api-key
MUT_MODEL=model-name
```