# AI Model Evaluation Framework Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios. ## Features - **Comprehensive Test Coverage** - Logic & Reasoning - Mathematics & Calculations - Instruction Following - Creative Writing - Code Generation - Language Nuance - IT Forensics (MFT analysis, file signatures, registry, memory, network) - Multi-turn conversations with context retention - **IT Forensics Focus** - Raw hex dump analysis (Master File Table) - File signature identification - Registry hive analysis - FILETIME conversions - Memory artifact extraction - TCP/IP header analysis - Timeline reconstruction - **Automated Testing** - OpenAI-compatible API support (Ollama, LM Studio, etc.) - Interactive evaluation with scoring rubric - Progress tracking and auto-save - Multi-turn conversation handling - **Analysis & Comparison** - Cross-model comparison reports - Category-wise performance breakdown - Difficulty-based analysis - CSV export for further analysis - **🌐 Interactive Web Dashboard** (New!) - Visual analytics with charts and graphs - Advanced intelligence metrics - Filtering, sorting, and statistical analysis - Multi-dimensional performance evaluation ## Quick Start ### Prerequisites ```bash # Python 3.8+ pip install -r requirements.txt # or manually: pip install pyyaml requests python-dotenv ``` ### Installation ```bash # Clone or download the files # Copy the example environment file cp .env.example .env # Edit .env with your settings # - Configure the model under test (MUT_*) # - Configure the evaluator model for non-interactive mode (EVALUATOR_*) # - Set NON_INTERACTIVE=true for automated evaluation nano .env ``` ### Configuration with .env File (Recommended) The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode: ```bash # Model Under Test (MUT) - The model being evaluated MUT_ENDPOINT=http://localhost:11434 MUT_API_KEY= # Optional for local endpoints MUT_MODEL=qwen3:4b-q4_K_M # Evaluator API - For non-interactive automated scoring EVALUATOR_ENDPOINT=http://localhost:11434 EVALUATOR_API_KEY= # Optional EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring # Execution Mode NON_INTERACTIVE=false # Set to true for automated evaluation TEST_SUITE=test_suite.yaml OUTPUT_DIR=results FILTER_CATEGORY= # Optional: filter by category ``` ### Basic Usage #### 0. Test Connectivity (Dry Run) Before running the full test suite, verify that your API endpoints are reachable and properly configured: ```bash # Test MUT endpoint connectivity python ai_eval.py --dry-run # Test with specific configuration python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run # Test non-interactive mode (tests both MUT and evaluator endpoints) python ai_eval.py --non-interactive --dry-run # Test multiple models python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run ``` The dry-run mode will: - Test connectivity to the model under test endpoint(s) - Verify authentication (API keys) - Confirm model availability - Test evaluator endpoint if in non-interactive mode - Exit with success/failure status #### 1. Interactive Mode (Manual Evaluation) ```bash # Using .env file python ai_eval.py # Or with command-line arguments python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M # For other endpoints with API key python ai_eval.py \ --endpoint https://api.example.com \ --api-key sk-your-key-here \ --model your-model-name ``` #### 2. Non-Interactive Mode (Automated Evaluation) Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention. ```bash # Configure .env file NON_INTERACTIVE=true EVALUATOR_ENDPOINT=http://localhost:11434 EVALUATOR_MODEL=qwen3:14b # Run the test python ai_eval.py # Or with command-line arguments python ai_eval.py \ --endpoint http://localhost:11434 \ --model qwen3:4b-q4_K_M \ --non-interactive \ --evaluator-endpoint http://localhost:11434 \ --evaluator-model qwen3:14b ``` **How Non-Interactive Mode Works:** - For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API - The evaluator model analyzes the response and returns a score (0-5) with notes - This enables automated, consistent scoring across multiple model runs - The evaluator uses a specialized system prompt designed for objective evaluation **Choosing an Evaluator Model:** - Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation - The evaluator model should be more capable than the model under test - Lower temperature (0.3) provides more consistent scoring #### 3. Test Multiple Models (Batch Mode) Test multiple models in one run by specifying comma-separated model names: ```bash # In .env file MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M # Run batch test python ai_eval.py # Or via command line python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16 ``` The script will automatically test each model sequentially and save individual results. #### 4. Filter by Category ```bash # Test only IT Forensics categories python ai_eval.py --category "IT Forensics - File Systems" ``` #### 5. Analyze Results ```bash ## Analyzing Results ### Interactive Web Dashboard (Recommended) Launch the comprehensive web interface for visual analysis: ```bash # Start web dashboard (opens automatically in browser) python analyze_results.py --web # Custom host/port python analyze_results.py --web --host 0.0.0.0 --port 8080 ``` **Features:** - 📊 Visual comparison charts and graphs - 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth) - 🔍 Interactive filtering and sorting - 📈 Statistical analysis (consistency, robustness) - 📂 Category and difficulty breakdowns - 💡 Multi-dimensional cognitive evaluation See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation. ### Command-Line Analysis ```bash # Compare all models python analyze_results.py --compare # Detailed report for specific model python analyze_results.py --detail "qwen3:4b-q4_K_M" # Export to CSV python analyze_results.py --export comparison.csv ``` ## Scoring Rubric All tests are evaluated on a 0-5 scale: | Score | Category | Description | |-------|----------|-------------| | 0-1 | **FAIL** | Major errors, fails to meet basic requirements | | 2-3 | **PASS** | Meets requirements with minor issues | | 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding | ### Evaluation Criteria #### Constraint Adherence - Fail: Misses more than one constraint or forbidden word - Pass: Follows all constraints but flow is awkward - Exceptional: Follows all constraints with natural, fluid language #### Unit Precision (for math/forensics) - Fail: Errors in basic conversion - Pass: Correct conversions but rounding errors - Exceptional: Perfect precision across systems #### Reasoning Path - Fail: Gives only final answer without steps - Pass: Shows steps but logic contains "leaps" - Exceptional: Transparent, logical chain-of-thought #### Code Safety - Fail: Function crashes on bad input - Pass: Logic correct but lacks error handling - Exceptional: Production-ready with robust error catching ## Test Categories Overview ### General Reasoning (14 tests) - Logic puzzles & temporal reasoning - Multi-step mathematics - Strict instruction following - Creative writing with constraints - Code generation - Language nuance understanding - Problem-solving & logistics ### IT Forensics (8 tests) #### File Systems - **MFT Basic Analysis**: Signature, status flags, sequence numbers - **MFT Advanced**: Update sequence arrays, LSN, attribute offsets - **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR) #### Registry & Artifacts - **Registry Hive Headers**: Signature, sequence numbers, format version - **FILETIME Conversion**: Windows timestamp decoding #### Memory & Network - **Memory Artifacts**: HTTP request extraction from dumps - **TCP Headers**: Port, sequence, flags, window size analysis #### Timeline Analysis - **Event Reconstruction**: Log correlation, attack narrative building ### Multi-turn Conversations (3 tests) - Progressive hex analysis (PE file structure) - Forensic investigation scenario - Technical depth building (NTFS ADS) ## File Structure ```bash . ├── ai_eval.py # Main testing script ├── analyze_results.py # Results analysis and comparison ├── test_suite.yaml # Test definitions ├── .env.example # Configuration template ├── results/ # Auto-created results directory │ ├── qwen3_4b-q4_K_M_latest.json │ ├── qwen3_4b-q8_0_latest.json │ └── qwen3_4b-fp16_latest.json └── README.md ``` ## Configuration Reference ### Environment Variables (.env file) All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values. #### Model Under Test (MUT) | Variable | Description | Example | | --- | --- | --- | | `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` | | `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` | | `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` | #### Evaluator Configuration (for Non-Interactive Mode) | Variable | Description | Example | | --- | --- | --- | | `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` | | `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` | | `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` | | `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` | #### Test Configuration | Variable | Description | Example | | --- | --- | --- | | `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` | | `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` | | `OUTPUT_DIR` | Results output directory | `results` | | `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` | ### Command-Line Arguments All environment variables have corresponding command-line flags: ```bash python ai_eval.py --help Options: --endpoint ENDPOINT Model under test endpoint --api-key API_KEY Model under test API key --model MODEL Model name to test --test-suite FILE Test suite YAML file --output-dir DIR Output directory --category CATEGORY Filter by category --non-interactive Enable automated evaluation --evaluator-endpoint ENDPOINT Evaluator API endpoint --evaluator-api-key KEY Evaluator API key --evaluator-model MODEL Evaluator model name --evaluator-temperature TEMP Evaluator temperature ``` ## Advanced Usage ### Custom Test Suite Edit `test_suite.yaml` to add your own tests: ```yaml - category: "Your Category" tests: - id: "custom_01" name: "Your Test Name" type: "single_turn" # or "multi_turn" prompt: "Your test prompt here" evaluation_criteria: - "Criterion 1" - "Criterion 2" expected_difficulty: "medium" # medium, hard, very_hard ``` ### Batch Testing Examples Testing multiple models using the `.env` configuration: ```bash # Configure .env with multiple models cp .env.example .env nano .env # Set multiple models (comma-separated) MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M # Run batch tests python ai_eval.py # Or via command line python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M # Generate comparison after testing python analyze_results.py --compare ``` ### Custom Endpoint Configuration For OpenAI-compatible cloud services: ```bash # In .env file MUT_ENDPOINT=https://api.service.com MUT_API_KEY=your-api-key MUT_MODEL=model-name ```