initial commit

2026-01-16 09:18:07 +01:00
parent 1ef6758b3d
commit 514bd9b571
7 changed files with 2134 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,251 @@
-# llm-eval-forensics
+# AI Model Evaluation Framework

+Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
+
+## Features
+
+- **Comprehensive Test Coverage**
+  - Logic & Reasoning
+  - Mathematics & Calculations
+  - Instruction Following
+  - Creative Writing
+  - Code Generation
+  - Language Nuance
+  - IT Forensics (MFT analysis, file signatures, registry, memory, network)
+  - Multi-turn conversations with context retention
+
+- **IT Forensics Focus**
+  - Raw hex dump analysis (Master File Table)
+  - File signature identification
+  - Registry hive analysis
+  - FILETIME conversions
+  - Memory artifact extraction
+  - TCP/IP header analysis
+  - Timeline reconstruction
+
+- **Automated Testing**
+  - OpenAI-compatible API support (Ollama, LM Studio, etc.)
+  - Interactive evaluation with scoring rubric
+  - Progress tracking and auto-save
+  - Multi-turn conversation handling
+
+- **Analysis & Comparison**
+  - Cross-model comparison reports
+  - Category-wise performance breakdown
+  - Difficulty-based analysis
+  - CSV export for further analysis
+
+## Quick Start
+
+### Prerequisites
+
+```bash
+# Python 3.8+
+pip install pyyaml requests
+```
+
+### Installation
+
+```bash
+# Clone or download the files
+# Ensure these files are in your working directory:
+# - ai_eval.py
+# - analyze_results.py
+# - test_suite.yaml
+```
+
+### Basic Usage
+
+#### 1. Test a Single Model
+
+```bash
+# For Ollama (default: http://localhost:11434)
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
+
+# For other endpoints with API key
+python ai_eval.py \
+  --endpoint https://api.example.com \
+  --api-key sk-your-key-here \
+  --model your-model-name
+```
+
+#### 2. Test Multiple Models (Quantization Comparison)
+
+```bash
+# Test different quantizations of qwen3:4b
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
+
+# Test different model sizes
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
+```
+
+#### 3. Filter by Category
+
+```bash
+# Test only IT Forensics categories
+python ai_eval.py \
+  --endpoint http://localhost:11434 \
+  --model qwen3:4b \
+  --category "IT Forensics - File Systems"
+```
+
+#### 4. Analyze Results
+
+```bash
+# Compare all tested models
+python analyze_results.py --compare
+
+# Detailed report for specific model
+python analyze_results.py --detail "qwen3:4b-q4_K_M"
+
+# Export to CSV
+python analyze_results.py --export comparison.csv
+```
+
+## Scoring Rubric
+
+All tests are evaluated on a 0-5 scale:
+
+| Score | Category | Description |
+|-------|----------|-------------|
+| 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
+| 2-3 | **PASS** | Meets requirements with minor issues |
+| 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |
+
+### Evaluation Criteria
+
+#### Constraint Adherence
+
+- Fail: Misses more than one constraint or forbidden word
+- Pass: Follows all constraints but flow is awkward
+- Exceptional: Follows all constraints with natural, fluid language
+
+#### Unit Precision (for math/forensics)
+
+- Fail: Errors in basic conversion
+- Pass: Correct conversions but rounding errors
+- Exceptional: Perfect precision across systems
+
+#### Reasoning Path
+
+- Fail: Gives only final answer without steps
+- Pass: Shows steps but logic contains "leaps"
+- Exceptional: Transparent, logical chain-of-thought
+
+#### Code Safety
+
+- Fail: Function crashes on bad input
+- Pass: Logic correct but lacks error handling
+- Exceptional: Production-ready with robust error catching
+
+## Test Categories Overview
+
+### General Reasoning (14 tests)
+
+- Logic puzzles & temporal reasoning
+- Multi-step mathematics
+- Strict instruction following
+- Creative writing with constraints
+- Code generation
+- Language nuance understanding
+- Problem-solving & logistics
+
+### IT Forensics (8 tests)
+
+#### File Systems
+
+- **MFT Basic Analysis**: Signature, status flags, sequence numbers
+- **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
+- **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
+
+#### Registry & Artifacts
+
+- **Registry Hive Headers**: Signature, sequence numbers, format version
+- **FILETIME Conversion**: Windows timestamp decoding
+
+#### Memory & Network
+
+- **Memory Artifacts**: HTTP request extraction from dumps
+- **TCP Headers**: Port, sequence, flags, window size analysis
+
+#### Timeline Analysis
+
+- **Event Reconstruction**: Log correlation, attack narrative building
+
+### Multi-turn Conversations (3 tests)
+
+- Progressive hex analysis (PE file structure)
+- Forensic investigation scenario
+- Technical depth building (NTFS ADS)
+
+## File Structure
+
+```bash
+.
+├── ai_eval.py              # Main testing script
+├── analyze_results.py      # Results analysis and comparison
+├── test_suite.yaml         # Test definitions
+├── results/                # Auto-created results directory
+│   ├── qwen3_4b-q4_K_M_latest.json
+│   ├── qwen3_4b-q8_0_latest.json
+│   └── qwen3_4b-fp16_latest.json
+└── README.md
+```
+
+## Advanced Usage
+
+### Custom Test Suite
+
+Edit `test_suite.yaml` to add your own tests:
+
+```yaml
+- category: "Your Category"
+  tests:
+    - id: "custom_01"
+      name: "Your Test Name"
+      type: "single_turn"  # or "multi_turn"
+      prompt: "Your test prompt here"
+      evaluation_criteria:
+        - "Criterion 1"
+        - "Criterion 2"
+      expected_difficulty: "medium"  # medium, hard, very_hard
+```
+
+### Batch Testing Script
+
+Create `batch_test.sh`:
+
+```bash
+#!/bin/bash
+
+ENDPOINT="http://localhost:11434"
+
+# Test all qwen3:4b quantizations
+for quant in q4_K_M q8_0 fp16; do
+    echo "Testing qwen3:4b-${quant}..."
+    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
+done
+
+# Test all sizes with q4_K_M
+for size in 4b 8b 14b; do
+    echo "Testing qwen3:${size}-q4_K_M..."
+    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
+done
+
+# Generate comparison
+python analyze_results.py --compare
+```
+
+### Custom Endpoint Configuration
+
+For OpenAI-compatible cloud services:
+
+```bash
+python ai_eval.py \
+  --endpoint https://api.service.com \
+  --api-key your-api-key \
+  --model model-name
+```