514bd9b571d9331020fe9ed7e2a0d8a2a1e44046
AI Model Evaluation Framework
Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
Features
-
Comprehensive Test Coverage
- Logic & Reasoning
- Mathematics & Calculations
- Instruction Following
- Creative Writing
- Code Generation
- Language Nuance
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
- Multi-turn conversations with context retention
-
IT Forensics Focus
- Raw hex dump analysis (Master File Table)
- File signature identification
- Registry hive analysis
- FILETIME conversions
- Memory artifact extraction
- TCP/IP header analysis
- Timeline reconstruction
-
Automated Testing
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
- Interactive evaluation with scoring rubric
- Progress tracking and auto-save
- Multi-turn conversation handling
-
Analysis & Comparison
- Cross-model comparison reports
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis
Quick Start
Prerequisites
# Python 3.8+
pip install pyyaml requests
Installation
# Clone or download the files
# Ensure these files are in your working directory:
# - ai_eval.py
# - analyze_results.py
# - test_suite.yaml
Basic Usage
1. Test a Single Model
# For Ollama (default: http://localhost:11434)
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
# For other endpoints with API key
python ai_eval.py \
--endpoint https://api.example.com \
--api-key sk-your-key-here \
--model your-model-name
2. Test Multiple Models (Quantization Comparison)
# Test different quantizations of qwen3:4b
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
# Test different model sizes
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
3. Filter by Category
# Test only IT Forensics categories
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b \
--category "IT Forensics - File Systems"
4. Analyze Results
# Compare all tested models
python analyze_results.py --compare
# Detailed report for specific model
python analyze_results.py --detail "qwen3:4b-q4_K_M"
# Export to CSV
python analyze_results.py --export comparison.csv
Scoring Rubric
All tests are evaluated on a 0-5 scale:
| Score | Category | Description |
|---|---|---|
| 0-1 | FAIL | Major errors, fails to meet basic requirements |
| 2-3 | PASS | Meets requirements with minor issues |
| 4-5 | EXCEPTIONAL | Exceeds requirements, demonstrates deep understanding |
Evaluation Criteria
Constraint Adherence
- Fail: Misses more than one constraint or forbidden word
- Pass: Follows all constraints but flow is awkward
- Exceptional: Follows all constraints with natural, fluid language
Unit Precision (for math/forensics)
- Fail: Errors in basic conversion
- Pass: Correct conversions but rounding errors
- Exceptional: Perfect precision across systems
Reasoning Path
- Fail: Gives only final answer without steps
- Pass: Shows steps but logic contains "leaps"
- Exceptional: Transparent, logical chain-of-thought
Code Safety
- Fail: Function crashes on bad input
- Pass: Logic correct but lacks error handling
- Exceptional: Production-ready with robust error catching
Test Categories Overview
General Reasoning (14 tests)
- Logic puzzles & temporal reasoning
- Multi-step mathematics
- Strict instruction following
- Creative writing with constraints
- Code generation
- Language nuance understanding
- Problem-solving & logistics
IT Forensics (8 tests)
File Systems
- MFT Basic Analysis: Signature, status flags, sequence numbers
- MFT Advanced: Update sequence arrays, LSN, attribute offsets
- File Signatures: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
Registry & Artifacts
- Registry Hive Headers: Signature, sequence numbers, format version
- FILETIME Conversion: Windows timestamp decoding
Memory & Network
- Memory Artifacts: HTTP request extraction from dumps
- TCP Headers: Port, sequence, flags, window size analysis
Timeline Analysis
- Event Reconstruction: Log correlation, attack narrative building
Multi-turn Conversations (3 tests)
- Progressive hex analysis (PE file structure)
- Forensic investigation scenario
- Technical depth building (NTFS ADS)
File Structure
.
├── ai_eval.py # Main testing script
├── analyze_results.py # Results analysis and comparison
├── test_suite.yaml # Test definitions
├── results/ # Auto-created results directory
│ ├── qwen3_4b-q4_K_M_latest.json
│ ├── qwen3_4b-q8_0_latest.json
│ └── qwen3_4b-fp16_latest.json
└── README.md
Advanced Usage
Custom Test Suite
Edit test_suite.yaml to add your own tests:
- category: "Your Category"
tests:
- id: "custom_01"
name: "Your Test Name"
type: "single_turn" # or "multi_turn"
prompt: "Your test prompt here"
evaluation_criteria:
- "Criterion 1"
- "Criterion 2"
expected_difficulty: "medium" # medium, hard, very_hard
Batch Testing Script
Create batch_test.sh:
#!/bin/bash
ENDPOINT="http://localhost:11434"
# Test all qwen3:4b quantizations
for quant in q4_K_M q8_0 fp16; do
echo "Testing qwen3:4b-${quant}..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
done
# Test all sizes with q4_K_M
for size in 4b 8b 14b; do
echo "Testing qwen3:${size}-q4_K_M..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
done
# Generate comparison
python analyze_results.py --compare
Custom Endpoint Configuration
For OpenAI-compatible cloud services:
python ai_eval.py \
--endpoint https://api.service.com \
--api-key your-api-key \
--model model-name
Languages
Python
70.8%
HTML
29.2%