llm-eval-forensics/README.md

# AI Model Evaluation Framework

Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.

## Features

- **Comprehensive Test Coverage**
  - Logic & Reasoning
  - Mathematics & Calculations
  - Instruction Following
  - Creative Writing
  - Code Generation
  - Language Nuance
  - IT Forensics (MFT analysis, file signatures, registry, memory, network)
  - Multi-turn conversations with context retention

- **IT Forensics Focus**
  - Raw hex dump analysis (Master File Table)
  - File signature identification
  - Registry hive analysis
  - FILETIME conversions
  - Memory artifact extraction
  - TCP/IP header analysis
  - Timeline reconstruction

- **Automated Testing**
  - OpenAI-compatible API support (Ollama, LM Studio, etc.)
  - Interactive evaluation with scoring rubric
  - Progress tracking and auto-save
  - Multi-turn conversation handling

- **Analysis & Comparison**
  - Cross-model comparison reports
  - Category-wise performance breakdown
  - Difficulty-based analysis
  - CSV export for further analysis

## Quick Start

### Prerequisites

```bash
# Python 3.8+
pip install pyyaml requests
```

### Installation

```bash
# Clone or download the files
# Ensure these files are in your working directory:
# - ai_eval.py
# - analyze_results.py
# - test_suite.yaml
```

### Basic Usage

#### 1. Test a Single Model

```bash
# For Ollama (default: http://localhost:11434)
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

# For other endpoints with API key
python ai_eval.py \
  --endpoint https://api.example.com \
  --api-key sk-your-key-here \
  --model your-model-name
```

#### 2. Test Multiple Models (Quantization Comparison)

```bash
# Test different quantizations of qwen3:4b
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16

# Test different model sizes
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
```

#### 3. Filter by Category

```bash
# Test only IT Forensics categories
python ai_eval.py \
  --endpoint http://localhost:11434 \
  --model qwen3:4b \
  --category "IT Forensics - File Systems"
```

#### 4. Analyze Results

```bash
# Compare all tested models
python analyze_results.py --compare

# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"

# Export to CSV
python analyze_results.py --export comparison.csv
```

## Scoring Rubric

All tests are evaluated on a 0-5 scale:

| Score | Category | Description |
|-------|----------|-------------|
| 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
| 2-3 | **PASS** | Meets requirements with minor issues |
| 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |

### Evaluation Criteria

#### Constraint Adherence

- Fail: Misses more than one constraint or forbidden word
- Pass: Follows all constraints but flow is awkward
- Exceptional: Follows all constraints with natural, fluid language

#### Unit Precision (for math/forensics)

- Fail: Errors in basic conversion
- Pass: Correct conversions but rounding errors
- Exceptional: Perfect precision across systems

#### Reasoning Path

- Fail: Gives only final answer without steps
- Pass: Shows steps but logic contains "leaps"
- Exceptional: Transparent, logical chain-of-thought

#### Code Safety

- Fail: Function crashes on bad input
- Pass: Logic correct but lacks error handling
- Exceptional: Production-ready with robust error catching

## Test Categories Overview

### General Reasoning (14 tests)

- Logic puzzles & temporal reasoning
- Multi-step mathematics
- Strict instruction following
- Creative writing with constraints
- Code generation
- Language nuance understanding
- Problem-solving & logistics

### IT Forensics (8 tests)

#### File Systems

- **MFT Basic Analysis**: Signature, status flags, sequence numbers
- **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
- **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)

#### Registry & Artifacts

- **Registry Hive Headers**: Signature, sequence numbers, format version
- **FILETIME Conversion**: Windows timestamp decoding

#### Memory & Network

- **Memory Artifacts**: HTTP request extraction from dumps
- **TCP Headers**: Port, sequence, flags, window size analysis

#### Timeline Analysis

- **Event Reconstruction**: Log correlation, attack narrative building

### Multi-turn Conversations (3 tests)

- Progressive hex analysis (PE file structure)
- Forensic investigation scenario
- Technical depth building (NTFS ADS)

## File Structure

```bash
.
├── ai_eval.py              # Main testing script
├── analyze_results.py      # Results analysis and comparison
├── test_suite.yaml         # Test definitions
├── results/                # Auto-created results directory
│   ├── qwen3_4b-q4_K_M_latest.json
│   ├── qwen3_4b-q8_0_latest.json
│   └── qwen3_4b-fp16_latest.json
└── README.md
```

## Advanced Usage

### Custom Test Suite

Edit `test_suite.yaml` to add your own tests:

```yaml
- category: "Your Category"
  tests:
    - id: "custom_01"
      name: "Your Test Name"
      type: "single_turn"  # or "multi_turn"
      prompt: "Your test prompt here"
      evaluation_criteria:
        - "Criterion 1"
        - "Criterion 2"
      expected_difficulty: "medium"  # medium, hard, very_hard
```

### Batch Testing Script

Create `batch_test.sh`:

```bash
#!/bin/bash

ENDPOINT="http://localhost:11434"

# Test all qwen3:4b quantizations
for quant in q4_K_M q8_0 fp16; do
    echo "Testing qwen3:4b-${quant}..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
done

# Test all sizes with q4_K_M
for size in 4b 8b 14b; do
    echo "Testing qwen3:${size}-q4_K_M..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
done

# Generate comparison
python analyze_results.py --compare
```

### Custom Endpoint Configuration

For OpenAI-compatible cloud services:

```bash
python ai_eval.py \
  --endpoint https://api.service.com \
  --api-key your-api-key \
  --model model-name
```