# AI Model Evaluation Framework Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios. ## Features - **Comprehensive Test Coverage** - Logic & Reasoning - Mathematics & Calculations - Instruction Following - Creative Writing - Code Generation - Language Nuance - IT Forensics (MFT analysis, file signatures, registry, memory, network) - Multi-turn conversations with context retention - **IT Forensics Focus** - Raw hex dump analysis (Master File Table) - File signature identification - Registry hive analysis - FILETIME conversions - Memory artifact extraction - TCP/IP header analysis - Timeline reconstruction - **Automated Testing** - OpenAI-compatible API support (Ollama, LM Studio, etc.) - Interactive evaluation with scoring rubric - Progress tracking and auto-save - Multi-turn conversation handling - **Analysis & Comparison** - Cross-model comparison reports - Category-wise performance breakdown - Difficulty-based analysis - CSV export for further analysis ## Quick Start ### Prerequisites ```bash # Python 3.8+ pip install pyyaml requests ``` ### Installation ```bash # Clone or download the files # Ensure these files are in your working directory: # - ai_eval.py # - analyze_results.py # - test_suite.yaml ``` ### Basic Usage #### 1. Test a Single Model ```bash # For Ollama (default: http://localhost:11434) python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M # For other endpoints with API key python ai_eval.py \ --endpoint https://api.example.com \ --api-key sk-your-key-here \ --model your-model-name ``` #### 2. Test Multiple Models (Quantization Comparison) ```bash # Test different quantizations of qwen3:4b python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16 # Test different model sizes python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M ``` #### 3. Filter by Category ```bash # Test only IT Forensics categories python ai_eval.py \ --endpoint http://localhost:11434 \ --model qwen3:4b \ --category "IT Forensics - File Systems" ``` #### 4. Analyze Results ```bash # Compare all tested models python analyze_results.py --compare # Detailed report for specific model python analyze_results.py --detail "qwen3:4b-q4_K_M" # Export to CSV python analyze_results.py --export comparison.csv ``` ## Scoring Rubric All tests are evaluated on a 0-5 scale: | Score | Category | Description | |-------|----------|-------------| | 0-1 | **FAIL** | Major errors, fails to meet basic requirements | | 2-3 | **PASS** | Meets requirements with minor issues | | 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding | ### Evaluation Criteria #### Constraint Adherence - Fail: Misses more than one constraint or forbidden word - Pass: Follows all constraints but flow is awkward - Exceptional: Follows all constraints with natural, fluid language #### Unit Precision (for math/forensics) - Fail: Errors in basic conversion - Pass: Correct conversions but rounding errors - Exceptional: Perfect precision across systems #### Reasoning Path - Fail: Gives only final answer without steps - Pass: Shows steps but logic contains "leaps" - Exceptional: Transparent, logical chain-of-thought #### Code Safety - Fail: Function crashes on bad input - Pass: Logic correct but lacks error handling - Exceptional: Production-ready with robust error catching ## Test Categories Overview ### General Reasoning (14 tests) - Logic puzzles & temporal reasoning - Multi-step mathematics - Strict instruction following - Creative writing with constraints - Code generation - Language nuance understanding - Problem-solving & logistics ### IT Forensics (8 tests) #### File Systems - **MFT Basic Analysis**: Signature, status flags, sequence numbers - **MFT Advanced**: Update sequence arrays, LSN, attribute offsets - **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR) #### Registry & Artifacts - **Registry Hive Headers**: Signature, sequence numbers, format version - **FILETIME Conversion**: Windows timestamp decoding #### Memory & Network - **Memory Artifacts**: HTTP request extraction from dumps - **TCP Headers**: Port, sequence, flags, window size analysis #### Timeline Analysis - **Event Reconstruction**: Log correlation, attack narrative building ### Multi-turn Conversations (3 tests) - Progressive hex analysis (PE file structure) - Forensic investigation scenario - Technical depth building (NTFS ADS) ## File Structure ```bash . ├── ai_eval.py # Main testing script ├── analyze_results.py # Results analysis and comparison ├── test_suite.yaml # Test definitions ├── results/ # Auto-created results directory │ ├── qwen3_4b-q4_K_M_latest.json │ ├── qwen3_4b-q8_0_latest.json │ └── qwen3_4b-fp16_latest.json └── README.md ``` ## Advanced Usage ### Custom Test Suite Edit `test_suite.yaml` to add your own tests: ```yaml - category: "Your Category" tests: - id: "custom_01" name: "Your Test Name" type: "single_turn" # or "multi_turn" prompt: "Your test prompt here" evaluation_criteria: - "Criterion 1" - "Criterion 2" expected_difficulty: "medium" # medium, hard, very_hard ``` ### Batch Testing Script Create `batch_test.sh`: ```bash #!/bin/bash ENDPOINT="http://localhost:11434" # Test all qwen3:4b quantizations for quant in q4_K_M q8_0 fp16; do echo "Testing qwen3:4b-${quant}..." python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}" done # Test all sizes with q4_K_M for size in 4b 8b 14b; do echo "Testing qwen3:${size}-q4_K_M..." python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M" done # Generate comparison python analyze_results.py --compare ``` ### Custom Endpoint Configuration For OpenAI-compatible cloud services: ```bash python ai_eval.py \ --endpoint https://api.service.com \ --api-key your-api-key \ --model model-name ```