mstoeck3/llm-eval-forensics

Fork 0

Go to file

mstoeck3 514bd9b571 initial commit

2026-01-16 09:18:07 +01:00

.gitignore

Initial commit

2026-01-16 07:38:41 +00:00

ai_eval.py

initial commit

2026-01-16 09:18:07 +01:00

analyze_results.py

initial commit

2026-01-16 09:18:07 +01:00

batch_test.sh

initial commit

2026-01-16 09:18:07 +01:00

LICENSE

Initial commit

2026-01-16 07:38:41 +00:00

README.md

initial commit

2026-01-16 09:18:07 +01:00

requirements.txt

initial commit

2026-01-16 09:18:07 +01:00

test_suite.md

initial commit

2026-01-16 09:18:07 +01:00

test_suite.yaml

initial commit

2026-01-16 09:18:07 +01:00

README.md

AI Model Evaluation Framework

Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.

Features

Comprehensive Test Coverage
- Logic & Reasoning
- Mathematics & Calculations
- Instruction Following
- Creative Writing
- Code Generation
- Language Nuance
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
- Multi-turn conversations with context retention
IT Forensics Focus
- Raw hex dump analysis (Master File Table)
- File signature identification
- Registry hive analysis
- FILETIME conversions
- Memory artifact extraction
- TCP/IP header analysis
- Timeline reconstruction
Automated Testing
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
- Interactive evaluation with scoring rubric
- Progress tracking and auto-save
- Multi-turn conversation handling
Analysis & Comparison
- Cross-model comparison reports
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis

Quick Start

Prerequisites

# Python 3.8+
pip install pyyaml requests

Installation

# Clone or download the files
# Ensure these files are in your working directory:
# - ai_eval.py
# - analyze_results.py
# - test_suite.yaml

Basic Usage

1. Test a Single Model

# For Ollama (default: http://localhost:11434)
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

# For other endpoints with API key
python ai_eval.py \
  --endpoint https://api.example.com \
  --api-key sk-your-key-here \
  --model your-model-name

2. Test Multiple Models (Quantization Comparison)

# Test different quantizations of qwen3:4b
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16

# Test different model sizes
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M

3. Filter by Category

# Test only IT Forensics categories
python ai_eval.py \
  --endpoint http://localhost:11434 \
  --model qwen3:4b \
  --category "IT Forensics - File Systems"

4. Analyze Results

# Compare all tested models
python analyze_results.py --compare

# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"

# Export to CSV
python analyze_results.py --export comparison.csv

Scoring Rubric

All tests are evaluated on a 0-5 scale:

Score	Category	Description
0-1	FAIL	Major errors, fails to meet basic requirements
2-3	PASS	Meets requirements with minor issues
4-5	EXCEPTIONAL	Exceeds requirements, demonstrates deep understanding

Evaluation Criteria

Constraint Adherence

Fail: Misses more than one constraint or forbidden word
Pass: Follows all constraints but flow is awkward
Exceptional: Follows all constraints with natural, fluid language

Unit Precision (for math/forensics)

Fail: Errors in basic conversion
Pass: Correct conversions but rounding errors
Exceptional: Perfect precision across systems

Reasoning Path

Fail: Gives only final answer without steps
Pass: Shows steps but logic contains "leaps"
Exceptional: Transparent, logical chain-of-thought

Code Safety

Fail: Function crashes on bad input
Pass: Logic correct but lacks error handling
Exceptional: Production-ready with robust error catching

Test Categories Overview

General Reasoning (14 tests)

Logic puzzles & temporal reasoning
Multi-step mathematics
Strict instruction following
Creative writing with constraints
Code generation
Language nuance understanding
Problem-solving & logistics

IT Forensics (8 tests)

File Systems

MFT Basic Analysis: Signature, status flags, sequence numbers
MFT Advanced: Update sequence arrays, LSN, attribute offsets
File Signatures: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)

Registry & Artifacts

Registry Hive Headers: Signature, sequence numbers, format version
FILETIME Conversion: Windows timestamp decoding

Memory & Network

Memory Artifacts: HTTP request extraction from dumps
TCP Headers: Port, sequence, flags, window size analysis

Timeline Analysis

Event Reconstruction: Log correlation, attack narrative building

Multi-turn Conversations (3 tests)

Progressive hex analysis (PE file structure)
Forensic investigation scenario
Technical depth building (NTFS ADS)

File Structure

.
├── ai_eval.py              # Main testing script
├── analyze_results.py      # Results analysis and comparison
├── test_suite.yaml         # Test definitions
├── results/                # Auto-created results directory
│   ├── qwen3_4b-q4_K_M_latest.json
│   ├── qwen3_4b-q8_0_latest.json
│   └── qwen3_4b-fp16_latest.json
└── README.md

Advanced Usage

Custom Test Suite

Edit test_suite.yaml to add your own tests:

- category: "Your Category"
  tests:
    - id: "custom_01"
      name: "Your Test Name"
      type: "single_turn"  # or "multi_turn"
      prompt: "Your test prompt here"
      evaluation_criteria:
        - "Criterion 1"
        - "Criterion 2"
      expected_difficulty: "medium"  # medium, hard, very_hard

Batch Testing Script

Create batch_test.sh:

#!/bin/bash

ENDPOINT="http://localhost:11434"

# Test all qwen3:4b quantizations
for quant in q4_K_M q8_0 fp16; do
    echo "Testing qwen3:4b-${quant}..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
done

# Test all sizes with q4_K_M
for size in 4b 8b 14b; do
    echo "Testing qwen3:${size}-q4_K_M..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
done

# Generate comparison
python analyze_results.py --compare

Custom Endpoint Configuration

For OpenAI-compatible cloud services:

python ai_eval.py \
  --endpoint https://api.service.com \
  --api-key your-api-key \
  --model model-name