initial commit

2026-01-16 09:18:07 +01:00
parent 1ef6758b3d
commit 514bd9b571
7 changed files with 2134 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,251 @@
-# llm-eval-forensics
+# AI Model Evaluation Framework
 Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
 ## Features
 - **Comprehensive Test Coverage**
  - Logic & Reasoning
  - Mathematics & Calculations
  - Instruction Following
  - Creative Writing
  - Code Generation
  - Language Nuance
  - IT Forensics (MFT analysis, file signatures, registry, memory, network)
  - Multi-turn conversations with context retention
 - **IT Forensics Focus**
  - Raw hex dump analysis (Master File Table)
  - File signature identification
  - Registry hive analysis
  - FILETIME conversions
  - Memory artifact extraction
  - TCP/IP header analysis
  - Timeline reconstruction
 - **Automated Testing**
  - OpenAI-compatible API support (Ollama, LM Studio, etc.)
  - Interactive evaluation with scoring rubric
  - Progress tracking and auto-save
  - Multi-turn conversation handling
 - **Analysis & Comparison**
  - Cross-model comparison reports
  - Category-wise performance breakdown
  - Difficulty-based analysis
  - CSV export for further analysis
 ## Quick Start
 ### Prerequisites
 ```bash
 # Python 3.8+
 pip install pyyaml requests
 ```
 ### Installation
 ```bash
 # Clone or download the files
 # Ensure these files are in your working directory:
 # - ai_eval.py
 # - analyze_results.py
 # - test_suite.yaml
 ```
 ### Basic Usage
 #### 1. Test a Single Model
 ```bash
 # For Ollama (default: http://localhost:11434)
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
 # For other endpoints with API key
 python ai_eval.py \
  --endpoint https://api.example.com \
  --api-key sk-your-key-here \
  --model your-model-name
 ```
 #### 2. Test Multiple Models (Quantization Comparison)
 ```bash
 # Test different quantizations of qwen3:4b
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
 # Test different model sizes
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
 ```
 #### 3. Filter by Category
 ```bash
 # Test only IT Forensics categories
 python ai_eval.py \
  --endpoint http://localhost:11434 \
  --model qwen3:4b \
  --category "IT Forensics - File Systems"
 ```
 #### 4. Analyze Results
 ```bash
 # Compare all tested models
 python analyze_results.py --compare
 # Detailed report for specific model
 python analyze_results.py --detail "qwen3:4b-q4_K_M"
 # Export to CSV
 python analyze_results.py --export comparison.csv
 ```
 ## Scoring Rubric
 All tests are evaluated on a 0-5 scale:
 | Score | Category | Description |
 |-------|----------|-------------|
 | 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
 | 2-3 | **PASS** | Meets requirements with minor issues |
 | 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |
 ### Evaluation Criteria
 #### Constraint Adherence
 - Fail: Misses more than one constraint or forbidden word
 - Pass: Follows all constraints but flow is awkward
 - Exceptional: Follows all constraints with natural, fluid language
 #### Unit Precision (for math/forensics)
 - Fail: Errors in basic conversion
 - Pass: Correct conversions but rounding errors
 - Exceptional: Perfect precision across systems
 #### Reasoning Path
 - Fail: Gives only final answer without steps
 - Pass: Shows steps but logic contains "leaps"
 - Exceptional: Transparent, logical chain-of-thought
 #### Code Safety
 - Fail: Function crashes on bad input
 - Pass: Logic correct but lacks error handling
 - Exceptional: Production-ready with robust error catching
 ## Test Categories Overview
 ### General Reasoning (14 tests)
 - Logic puzzles & temporal reasoning
 - Multi-step mathematics
 - Strict instruction following
 - Creative writing with constraints
 - Code generation
 - Language nuance understanding
 - Problem-solving & logistics
 ### IT Forensics (8 tests)
 #### File Systems
 - **MFT Basic Analysis**: Signature, status flags, sequence numbers
 - **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
 - **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
 #### Registry & Artifacts
 - **Registry Hive Headers**: Signature, sequence numbers, format version
 - **FILETIME Conversion**: Windows timestamp decoding
 #### Memory & Network
 - **Memory Artifacts**: HTTP request extraction from dumps
 - **TCP Headers**: Port, sequence, flags, window size analysis
 #### Timeline Analysis
 - **Event Reconstruction**: Log correlation, attack narrative building
 ### Multi-turn Conversations (3 tests)
 - Progressive hex analysis (PE file structure)
 - Forensic investigation scenario
 - Technical depth building (NTFS ADS)
 ## File Structure
 ```bash
 .
 ├── ai_eval.py              # Main testing script
 ├── analyze_results.py      # Results analysis and comparison
 ├── test_suite.yaml         # Test definitions
 ├── results/                # Auto-created results directory
 │   ├── qwen3_4b-q4_K_M_latest.json
 │   ├── qwen3_4b-q8_0_latest.json
 │   └── qwen3_4b-fp16_latest.json
 └── README.md
 ```
 ## Advanced Usage
 ### Custom Test Suite
 Edit `test_suite.yaml` to add your own tests:
 ```yaml
 - category: "Your Category"
  tests:
    - id: "custom_01"
      name: "Your Test Name"
      type: "single_turn"  # or "multi_turn"
      prompt: "Your test prompt here"
      evaluation_criteria:
        - "Criterion 1"
        - "Criterion 2"
      expected_difficulty: "medium"  # medium, hard, very_hard
 ```
 ### Batch Testing Script
 Create `batch_test.sh`:
 ```bash
 #!/bin/bash
 ENDPOINT="http://localhost:11434"
 # Test all qwen3:4b quantizations
 for quant in q4_K_M q8_0 fp16; do
    echo "Testing qwen3:4b-${quant}..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
 done
 # Test all sizes with q4_K_M
 for size in 4b 8b 14b; do
    echo "Testing qwen3:${size}-q4_K_M..."
    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
 done
 # Generate comparison
 python analyze_results.py --compare
 ```
 ### Custom Endpoint Configuration
 For OpenAI-compatible cloud services:
 ```bash
 python ai_eval.py \
  --endpoint https://api.service.com \
  --api-key your-api-key \
  --model model-name
 ```
--- a/ai_eval.py
+++ b/ai_eval.py
@@ -0,0 +1,468 @@
 #!/usr/bin/env python3
 """
 AI Model Evaluation Automation Script
 Runs comprehensive test suite against OpenAI-compatible API endpoints
 """
 import yaml
 import json
 import requests
 import os
 import sys
 from datetime import datetime
 from typing import Dict, List, Any, Optional
 from pathlib import Path
 import argparse
 class AIModelTester:
    def __init__(self, endpoint: str, api_key: str, model_name: str, output_dir: str = "results"):
        """
        Initialize the AI Model Tester
        Args:
            endpoint: OpenAI-compatible API endpoint URL
            api_key: API key for authentication
            model_name: Name/identifier of the model being tested
            output_dir: Directory to save results
        """
        self.endpoint = endpoint.rstrip('/')
        self.api_key = api_key
        self.model_name = model_name
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        # Results storage
        self.results = {
            "metadata": {
                "model_name": model_name,
                "endpoint": endpoint,
                "test_start": datetime.now().isoformat(),
                "test_end": None,
                "total_tests": 0,
                "completed_tests": 0
            },
            "test_results": []
        }
        # Current test session info
        self.current_test_id = None
        self.conversation_history = []
    def load_test_suite(self, yaml_file: str) -> Dict:
        """Load test suite from YAML file"""
        try:
            with open(yaml_file, 'r', encoding='utf-8') as f:
                return yaml.safe_load(f)
        except FileNotFoundError:
            print(f"Error: Test suite file not found: {yaml_file}")
            print(f"Please ensure {yaml_file} is in the current directory.")
            sys.exit(1)
        except yaml.YAMLError as e:
            print(f"Error: Invalid YAML format in {yaml_file}")
            print(f"Details: {e}")
            sys.exit(1)
        except Exception as e:
            print(f"Error loading test suite: {e}")
            sys.exit(1)
    def call_api(self, messages: List[Dict], temperature: float = 0.7, max_tokens: int = 2000) -> Optional[Dict]:
        """
        Call the OpenAI-compatible API
        Args:
            messages: List of message dicts with 'role' and 'content'
            temperature: Sampling temperature
            max_tokens: Maximum tokens in response
        Returns:
            API response dict or None if error
        """
        headers = {
            "Content-Type": "application/json"
        }
        # Only add Authorization header if API key is provided
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        payload = {
            "model": self.model_name,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens
        }
        try:
            response = requests.post(
                f"{self.endpoint}/v1/chat/completions",
                headers=headers,
                json=payload,
                timeout=120
            )
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"\n❌ API Error: {e}")
            if hasattr(e, 'response') and e.response is not None:
                print(f"Response: {e.response.text}")
            return None
    def display_test_info(self, test: Dict, category: str):
        """Display test information to user"""
        print("\n" + "="*80)
        print(f"📋 CATEGORY: {category}")
        print(f"🆔 Test ID: {test['id']}")
        print(f"📝 Test Name: {test['name']}")
        print(f"🎯 Type: {test['type']}")
        print(f"⚡ Difficulty: {test.get('expected_difficulty', 'N/A')}")
        print("="*80)
    def display_prompt(self, prompt: str, turn: Optional[int] = None):
        """Display the prompt being sent"""
        if turn is not None:
            print(f"\n🔄 TURN {turn}:")
        else:
            print(f"\n💬 PROMPT:")
        print("-"*80)
        print(prompt)
        print("-"*80)
    def display_response(self, response_text: str):
        """Display the model's response"""
        print(f"\n🤖 MODEL RESPONSE:")
        print("-"*80)
        print(response_text)
        print("-"*80)
    def display_evaluation_criteria(self, criteria: List[str]):
        """Display evaluation criteria for the test"""
        print(f"\n✅ EVALUATION CRITERIA:")
        for i, criterion in enumerate(criteria, 1):
            print(f"  {i}. {criterion}")
    def get_user_score(self) -> Dict:
        """Prompt user for evaluation score"""
        print("\n" + "="*80)
        print("📊 EVALUATION SCORING RUBRIC:")
        print("  0-1: FAIL - Major errors, fails to meet basic requirements")
        print("  2-3: PASS - Meets requirements with minor issues")
        print("  4-5: EXCEPTIONAL - Exceeds requirements, demonstrates deep understanding")
        print("="*80)
        while True:
            try:
                score_input = input("\n👉 Enter score (0-5) or 'skip' to skip this test: ").strip().lower()
                if score_input == 'skip':
                    return {"score": None, "notes": "Skipped by user"}
                score = int(score_input)
                if 0 <= score <= 5:
                    notes = input("📝 Notes (optional, press Enter to skip): ").strip()
                    return {"score": score, "notes": notes if notes else ""}
                else:
                    print("❌ Score must be between 0 and 5")
            except ValueError:
                print("❌ Invalid input. Please enter a number between 0 and 5, or 'skip'")
            except KeyboardInterrupt:
                print("\n\n⚠️  Test interrupted by user")
                return {"score": None, "notes": "Interrupted"}
    def run_single_turn_test(self, test: Dict, category: str) -> Dict:
        """Run a single-turn test"""
        self.display_test_info(test, category)
        self.display_prompt(test['prompt'])
        # Prepare messages
        messages = [{"role": "user", "content": test['prompt']}]
        # Call API
        response = self.call_api(messages)
        if response is None:
            return {
                "test_id": test['id'],
                "test_name": test['name'],
                "category": category,
                "type": "single_turn",
                "status": "api_error",
                "score": None,
                "notes": "API call failed"
            }
        # Extract response text
        response_text = response['choices'][0]['message']['content']
        self.display_response(response_text)
        # Display evaluation criteria
        self.display_evaluation_criteria(test.get('evaluation_criteria', []))
        # Get user evaluation
        evaluation = self.get_user_score()
        return {
            "test_id": test['id'],
            "test_name": test['name'],
            "category": category,
            "type": "single_turn",
            "difficulty": test.get('expected_difficulty', 'unknown'),
            "prompt": test['prompt'],
            "response": response_text,
            "evaluation_criteria": test.get('evaluation_criteria', []),
            "score": evaluation['score'],
            "notes": evaluation['notes'],
            "status": "completed" if evaluation['score'] is not None else "skipped",
            "timestamp": datetime.now().isoformat()
        }
    def run_multi_turn_test(self, test: Dict, category: str) -> Dict:
        """Run a multi-turn test"""
        self.display_test_info(test, category)
        # Initialize conversation history
        self.conversation_history = []
        turn_results = []
        for i, turn_data in enumerate(test['turns'], 1):
            turn_num = turn_data['turn']
            prompt = turn_data['prompt']
            self.display_prompt(prompt, turn_num)
            # Add to conversation history
            self.conversation_history.append({"role": "user", "content": prompt})
            # Call API with full conversation history
            response = self.call_api(self.conversation_history)
            if response is None:
                turn_results.append({
                    "turn": turn_num,
                    "status": "api_error",
                    "prompt": prompt,
                    "response": None
                })
                break
            # Extract and display response
            response_text = response['choices'][0]['message']['content']
            self.display_response(response_text)
            # Add assistant response to history
            self.conversation_history.append({"role": "assistant", "content": response_text})
            # Display criteria for this turn
            self.display_evaluation_criteria(turn_data.get('evaluation_criteria', []))
            # Get evaluation for this turn
            print(f"\n🎯 Evaluate Turn {turn_num}:")
            evaluation = self.get_user_score()
            turn_results.append({
                "turn": turn_num,
                "prompt": prompt,
                "response": response_text,
                "evaluation_criteria": turn_data.get('evaluation_criteria', []),
                "score": evaluation['score'],
                "notes": evaluation['notes'],
                "status": "completed" if evaluation['score'] is not None else "skipped"
            })
            if evaluation['score'] is None:
                print(f"\n⚠️  Turn {turn_num} skipped, stopping multi-turn test")
                break
        # Calculate overall score for multi-turn test
        valid_scores = [t['score'] for t in turn_results if t['score'] is not None]
        overall_score = sum(valid_scores) / len(valid_scores) if valid_scores else None
        return {
            "test_id": test['id'],
            "test_name": test['name'],
            "category": category,
            "type": "multi_turn",
            "difficulty": test.get('expected_difficulty', 'unknown'),
            "turns": turn_results,
            "overall_score": overall_score,
            "status": "completed" if overall_score is not None else "incomplete",
            "timestamp": datetime.now().isoformat()
        }
    def run_test_suite(self, test_suite: Dict, filter_category: Optional[str] = None):
        """Run the complete test suite"""
        print("\n" + "="*80)
        print(f"🚀 STARTING TEST SUITE")
        print(f"📦 Model: {self.model_name}")
        print(f"🔗 Endpoint: {self.endpoint}")
        print("="*80)
        # Count total tests
        total_tests = 0
        for cat_data in test_suite.get('test_categories', []):
            if filter_category and cat_data['category'] != filter_category:
                continue
            total_tests += len(cat_data.get('tests', []))
        self.results['metadata']['total_tests'] = total_tests
        # Run tests by category
        test_count = 0
        for cat_data in test_suite.get('test_categories', []):
            category = cat_data['category']
            # Apply category filter if specified
            if filter_category and category != filter_category:
                continue
            print(f"\n\n{'='*80}")
            print(f"📂 CATEGORY: {category}")
            print(f"{'='*80}")
            for test in cat_data.get('tests', []):
                test_count += 1
                print(f"\n📊 Progress: {test_count}/{total_tests}")
                # Run appropriate test type
                if test.get('type') == 'single_turn':
                    result = self.run_single_turn_test(test, category)
                elif test.get('type') == 'multi_turn':
                    result = self.run_multi_turn_test(test, category)
                else:
                    print(f"⚠️  Unknown test type: {test.get('type')}")
                    continue
                self.results['test_results'].append(result)
                self.results['metadata']['completed_tests'] += 1
                # Save after each test (in case of interruption)
                self.save_results()
        # Mark test suite as complete
        self.results['metadata']['test_end'] = datetime.now().isoformat()
        self.save_results()
        print("\n\n" + "="*80)
        print("✅ TEST SUITE COMPLETE")
        print("="*80)
        self.display_summary()
    def save_results(self):
        """Save results to JSON file"""
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"{self.model_name.replace(':', '_')}_{timestamp}.json"
        filepath = self.output_dir / filename
        with open(filepath, 'w', encoding='utf-8') as f:
            json.dump(self.results, f, indent=2, ensure_ascii=False)
        # Also save as "latest" for this model
        latest_file = self.output_dir / f"{self.model_name.replace(':', '_')}_latest.json"
        with open(latest_file, 'w', encoding='utf-8') as f:
            json.dump(self.results, f, indent=2, ensure_ascii=False)
    def display_summary(self):
        """Display test summary"""
        total = self.results['metadata']['total_tests']
        completed = self.results['metadata']['completed_tests']
        # Calculate statistics
        scores = [r.get('score') or r.get('overall_score') 
                 for r in self.results['test_results']]
        scores = [s for s in scores if s is not None]
        if scores:
            avg_score = sum(scores) / len(scores)
            print(f"\n📊 SUMMARY:")
            print(f"  Total Tests: {total}")
            print(f"  Completed: {completed}")
            print(f"  Average Score: {avg_score:.2f}/5.00")
            print(f"  Pass Rate: {len([s for s in scores if s >= 2]) / len(scores) * 100:.1f}%")
            print(f"  Exceptional Rate: {len([s for s in scores if s >= 4]) / len(scores) * 100:.1f}%")
        print(f"\n💾 Results saved to: {self.output_dir}")
 def main():
    parser = argparse.ArgumentParser(
        description="AI Model Evaluation Test Suite",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Test a single model
  python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
  # Test with API key
  python ai_eval.py --endpoint https://api.example.com --api-key sk-xxx --model qwen3:8b
  # Test only forensics category
  python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b --category "IT Forensics - File Systems"
  # Test multiple models (run separately)
  python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
  python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
  python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
        """
    )
    parser.add_argument(
        '--endpoint',
        required=True,
        help='OpenAI-compatible API endpoint (e.g., http://localhost:11434 for Ollama)'
    )
    parser.add_argument(
        '--api-key',
        default='',
        help='API key for authentication (optional for local endpoints)'
    )
    parser.add_argument(
        '--model',
        required=True,
        help='Model name/identifier (e.g., qwen3:4b-q4_K_M)'
    )
    parser.add_argument(
        '--test-suite',
        default='test_suite.yaml',
        help='Path to test suite YAML file (default: test_suite.yaml)'
    )
    parser.add_argument(
        '--output-dir',
        default='results',
        help='Directory to save results (default: results)'
    )
    parser.add_argument(
        '--category',
        default=None,
        help='Filter tests by category (optional)'
    )
    args = parser.parse_args()
    # Initialize tester
    tester = AIModelTester(
        endpoint=args.endpoint,
        api_key=args.api_key,
        model_name=args.model,
        output_dir=args.output_dir
    )
    # Load test suite
    print(f"📁 Loading test suite from: {args.test_suite}")
    test_suite = tester.load_test_suite(args.test_suite)
    # Run tests
    try:
        tester.run_test_suite(test_suite, filter_category=args.category)
    except KeyboardInterrupt:
        print("\n\n⚠️  Test suite interrupted by user")
        tester.results['metadata']['test_end'] = datetime.now().isoformat()
        tester.save_results()
        print(f"\n💾 Partial results saved to: {tester.output_dir}")
        sys.exit(1)
 if __name__ == "__main__":
    main()
--- a/analyze_results.py
+++ b/analyze_results.py
@@ -0,0 +1,355 @@
 #!/usr/bin/env python3
 """
 AI Model Evaluation Results Analyzer
 Compares results across different models and quantizations
 """
 import json
 import sys
 from pathlib import Path
 from typing import List, Dict
 import argparse
 from collections import defaultdict
 class ResultsAnalyzer:
    def __init__(self, results_dir: str = "results"):
        self.results_dir = Path(results_dir)
    def load_result_file(self, filepath: Path) -> Dict:
        """Load a single result file"""
        with open(filepath, 'r', encoding='utf-8') as f:
            return json.load(f)
    def find_result_files(self, pattern: str = "*_latest.json") -> List[Path]:
        """Find all result files matching pattern"""
        return sorted(self.results_dir.glob(pattern))
    def extract_scores_by_category(self, results: Dict) -> Dict[str, List[float]]:
        """Extract scores organized by category"""
        scores_by_category = defaultdict(list)
        for test in results.get('test_results', []):
            category = test.get('category', 'Unknown')
            score = test.get('score') or test.get('overall_score')
            if score is not None:
                scores_by_category[category].append(score)
        return dict(scores_by_category)
    def calculate_statistics(self, scores: List[float]) -> Dict:
        """Calculate statistics for a list of scores"""
        if not scores:
            return {
                'count': 0,
                'average': 0.0,
                'min': 0.0,
                'max': 0.0,
                'pass_rate': 0.0,
                'exceptional_rate': 0.0
            }
        return {
            'count': len(scores),
            'average': sum(scores) / len(scores),
            'min': min(scores),
            'max': max(scores),
            'pass_rate': len([s for s in scores if s >= 2]) / len(scores) * 100,
            'exceptional_rate': len([s for s in scores if s >= 4]) / len(scores) * 100
        }
    def compare_models(self, result_files: List[Path]):
        """Generate comparison report for multiple models"""
        print("\n" + "="*100)
        print("📊 AI MODEL COMPARISON REPORT")
        print("="*100)
        # Load all results
        all_results = {}
        for filepath in result_files:
            try:
                results = self.load_result_file(filepath)
                model_name = results['metadata']['model_name']
                all_results[model_name] = results
            except Exception as e:
                print(f"⚠️  Error loading {filepath}: {e}")
        if not all_results:
            print("❌ No valid result files found")
            return
        # Overall comparison
        print("\n📈 OVERALL PERFORMANCE")
        print("-"*100)
        print(f"{'Model':<30} {'Total Tests':<12} {'Avg Score':<12} {'Pass Rate':<12} {'Exceptional':<12}")
        print("-"*100)
        model_stats = {}
        for model_name, results in sorted(all_results.items()):
            all_scores = [
                test.get('score') or test.get('overall_score')
                for test in results.get('test_results', [])
            ]
            all_scores = [s for s in all_scores if s is not None]
            stats = self.calculate_statistics(all_scores)
            model_stats[model_name] = stats
            print(f"{model_name:<30} {stats['count']:<12} {stats['average']:<12.2f} "
                  f"{stats['pass_rate']:<12.1f}% {stats['exceptional_rate']:<12.1f}%")
        # Category-wise comparison
        print("\n\n📂 CATEGORY-WISE PERFORMANCE")
        print("="*100)
        # Get all unique categories
        all_categories = set()
        for results in all_results.values():
            for test in results.get('test_results', []):
                all_categories.add(test.get('category', 'Unknown'))
        for category in sorted(all_categories):
            print(f"\n🔖 {category}")
            print("-"*100)
            print(f"{'Model':<30} {'Tests':<8} {'Avg Score':<12} {'Pass Rate':<12} {'Exceptional':<12}")
            print("-"*100)
            for model_name, results in sorted(all_results.items()):
                cat_scores = [
                    test.get('score') or test.get('overall_score')
                    for test in results.get('test_results', [])
                    if test.get('category') == category and (test.get('score') or test.get('overall_score')) is not None
                ]
                if cat_scores:
                    stats = self.calculate_statistics(cat_scores)
                    print(f"{model_name:<30} {stats['count']:<8} {stats['average']:<12.2f} "
                          f"{stats['pass_rate']:<12.1f}% {stats['exceptional_rate']:<12.1f}%")
                else:
                    print(f"{model_name:<30} {'N/A':<8} {'N/A':<12} {'N/A':<12} {'N/A':<12}")
        # Difficulty-based comparison
        print("\n\n⚡ DIFFICULTY-BASED PERFORMANCE")
        print("="*100)
        difficulties = ['medium', 'hard', 'very_hard']
        for difficulty in difficulties:
            print(f"\n🎯 Difficulty: {difficulty.replace('_', ' ').title()}")
            print("-"*100)
            print(f"{'Model':<30} {'Tests':<8} {'Avg Score':<12} {'Pass Rate':<12}")
            print("-"*100)
            for model_name, results in sorted(all_results.items()):
                diff_scores = [
                    test.get('score') or test.get('overall_score')
                    for test in results.get('test_results', [])
                    if test.get('difficulty') == difficulty and (test.get('score') or test.get('overall_score')) is not None
                ]
                if diff_scores:
                    stats = self.calculate_statistics(diff_scores)
                    print(f"{model_name:<30} {stats['count']:<8} {stats['average']:<12.2f} "
                          f"{stats['pass_rate']:<12.1f}%")
                else:
                    print(f"{model_name:<30} {'N/A':<8} {'N/A':<12} {'N/A':<12}")
        # Winner analysis
        print("\n\n🏆 WINNERS BY CATEGORY")
        print("="*100)
        for category in sorted(all_categories):
            best_model = None
            best_score = -1
            for model_name, results in all_results.items():
                cat_scores = [
                    test.get('score') or test.get('overall_score')
                    for test in results.get('test_results', [])
                    if test.get('category') == category and (test.get('score') or test.get('overall_score')) is not None
                ]
                if cat_scores:
                    avg = sum(cat_scores) / len(cat_scores)
                    if avg > best_score:
                        best_score = avg
                        best_model = model_name
            if best_model:
                print(f"{category:<50} → {best_model} ({best_score:.2f})")
        print("\n\n🎖️ OVERALL WINNER")
        print("="*100)
        best_overall = max(model_stats.items(), key=lambda x: x[1]['average'])
        print(f"Model: {best_overall[0]}")
        print(f"Average Score: {best_overall[1]['average']:.2f}/5.00")
        print(f"Pass Rate: {best_overall[1]['pass_rate']:.1f}%")
        print(f"Exceptional Rate: {best_overall[1]['exceptional_rate']:.1f}%")
        print("="*100)
    def generate_detailed_report(self, model_name: str):
        """Generate detailed report for a specific model"""
        # Find result file for this model
        pattern = f"{model_name.replace(':', '_')}_latest.json"
        filepath = self.results_dir / pattern
        if not filepath.exists():
            print(f"❌ No results found for model: {model_name}")
            return
        results = self.load_result_file(filepath)
        print("\n" + "="*100)
        print(f"📋 DETAILED REPORT: {model_name}")
        print("="*100)
        # Metadata
        metadata = results.get('metadata', {})
        print(f"\n⏱️  Test Duration: {metadata.get('test_start')} → {metadata.get('test_end')}")
        print(f"📊 Tests: {metadata.get('completed_tests')}/{metadata.get('total_tests')}")
        # Overall stats
        all_scores = [
            test.get('score') or test.get('overall_score')
            for test in results.get('test_results', [])
        ]
        all_scores = [s for s in all_scores if s is not None]
        stats = self.calculate_statistics(all_scores)
        print(f"\n📈 Overall Performance:")
        print(f"  Average Score: {stats['average']:.2f}/5.00")
        print(f"  Pass Rate: {stats['pass_rate']:.1f}%")
        print(f"  Exceptional Rate: {stats['exceptional_rate']:.1f}%")
        print(f"  Score Range: {stats['min']:.1f} - {stats['max']:.1f}")
        # Test-by-test results
        print(f"\n\n📝 TEST-BY-TEST RESULTS")
        print("="*100)
        for test in results.get('test_results', []):
            score = test.get('score') or test.get('overall_score')
            status_icon = "✅" if score and score >= 4 else "⚠️" if score and score >= 2 else "❌"
            print(f"\n{status_icon} [{test.get('test_id')}] {test.get('test_name')}")
            print(f"   Category: {test.get('category')}")
            print(f"   Type: {test.get('type')}")
            print(f"   Difficulty: {test.get('difficulty', 'unknown')}")
            print(f"   Score: {score if score is not None else 'N/A'}/5.00")
            if test.get('notes'):
                print(f"   Notes: {test['notes']}")
            # Show criteria pass/fail if available
            if test.get('evaluation_criteria'):
                print(f"   Criteria ({len(test['evaluation_criteria'])} items):")
                for criterion in test['evaluation_criteria']:
                    print(f"     • {criterion}")
        print("\n" + "="*100)
    def export_csv(self, output_file: str = "comparison.csv"):
        """Export comparison data to CSV"""
        import csv
        result_files = self.find_result_files()
        if not result_files:
            print("❌ No result files found")
            return
        # Prepare CSV data
        csv_data = []
        headers = ['Model', 'Test ID', 'Test Name', 'Category', 'Type', 'Difficulty', 'Score', 'Notes']
        for filepath in result_files:
            results = self.load_result_file(filepath)
            model_name = results['metadata']['model_name']
            for test in results.get('test_results', []):
                csv_data.append([
                    model_name,
                    test.get('test_id', ''),
                    test.get('test_name', ''),
                    test.get('category', ''),
                    test.get('type', ''),
                    test.get('difficulty', ''),
                    test.get('score') or test.get('overall_score', ''),
                    test.get('notes', '')
                ])
        # Write CSV
        output_path = self.results_dir / output_file
        with open(output_path, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(headers)
            writer.writerows(csv_data)
        print(f"✅ CSV exported to: {output_path}")
 def main():
    parser = argparse.ArgumentParser(
        description="Analyze and compare AI model evaluation results",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Compare all models
  python analyze_results.py --compare
  # Detailed report for specific model
  python analyze_results.py --detail "qwen3:4b-q4_K_M"
  # Export to CSV
  python analyze_results.py --export comparison.csv
  # Custom results directory
  python analyze_results.py --results-dir ./my_results --compare
        """
    )
    parser.add_argument(
        '--results-dir',
        default='results',
        help='Directory containing result JSON files (default: results)'
    )
    parser.add_argument(
        '--compare',
        action='store_true',
        help='Generate comparison report for all models'
    )
    parser.add_argument(
        '--detail',
        type=str,
        help='Generate detailed report for specific model'
    )
    parser.add_argument(
        '--export',
        type=str,
        help='Export results to CSV file'
    )
    args = parser.parse_args()
    analyzer = ResultsAnalyzer(results_dir=args.results_dir)
    if args.compare:
        result_files = analyzer.find_result_files()
        if result_files:
            analyzer.compare_models(result_files)
        else:
            print(f"❌ No result files found in {args.results_dir}")
    if args.detail:
        analyzer.generate_detailed_report(args.detail)
    if args.export:
        analyzer.export_csv(args.export)
    if not (args.compare or args.detail or args.export):
        parser.print_help()
 if __name__ == "__main__":
    main()
--- a/batch_test.sh
+++ b/batch_test.sh
@@ -0,0 +1,85 @@
 #!/bin/bash
 # Batch Test Script for AI Model Evaluation
 # Tests multiple models and generates comparison report
 # Configuration
 ENDPOINT="${ENDPOINT:-http://localhost:11434}"
 API_KEY="${API_KEY:-}"
 # Color output
 GREEN='\033[0;32m'
 BLUE='\033[0;34m'
 YELLOW='\033[1;33m'
 NC='\033[0m' # No Color
 echo -e "${BLUE}========================================${NC}"
 echo -e "${BLUE}AI Model Batch Testing${NC}"
 echo -e "${BLUE}========================================${NC}"
 echo ""
 echo "Endpoint: $ENDPOINT"
 echo "API Key: ${API_KEY:0:10}${API_KEY:+...}"
 echo ""
 # Function to run test
 run_test() {
    local model=$1
    echo -e "${GREEN}Testing: $model${NC}"
    if [ -z "$API_KEY" ]; then
        python ai_eval.py --endpoint "$ENDPOINT" --model "$model"
    else
        python ai_eval.py --endpoint "$ENDPOINT" --api-key "$API_KEY" --model "$model"
    fi
    if [ $? -eq 0 ]; then
        echo -e "${GREEN}✓ Completed: $model${NC}"
    else
        echo -e "${YELLOW}⚠ Failed or interrupted: $model${NC}"
    fi
    echo ""
 }
 # Test qwen3:4b models with different quantizations
 echo -e "${BLUE}=== Testing qwen3:4b with different quantizations ===${NC}"
 echo ""
 models_4b=(
    "qwen3:4b-q4_K_M"
    "qwen3:4b-q8_0"
    "qwen3:4b-fp16"
 )
 for model in "${models_4b[@]}"; do
    run_test "$model"
 done
 # Test different model sizes with q4_K_M quantization
 echo -e "${BLUE}=== Testing different model sizes (q4_K_M) ===${NC}"
 echo ""
 models_sizes=(
    "qwen3:4b-q4_K_M"
    "qwen3:8b-q4_K_M"
    "qwen3:14b-q4_K_M"
 )
 for model in "${models_sizes[@]}"; do
    run_test "$model"
 done
 # Generate comparison report
 echo -e "${BLUE}========================================${NC}"
 echo -e "${BLUE}Generating Comparison Report${NC}"
 echo -e "${BLUE}========================================${NC}"
 echo ""
 python analyze_results.py --compare
 python analyze_results.py --export batch_comparison.csv
 echo ""
 echo -e "${GREEN}========================================${NC}"
 echo -e "${GREEN}Batch Testing Complete!${NC}"
 echo -e "${GREEN}========================================${NC}"
 echo ""
 echo "Results saved in ./results/"
 echo "Comparison CSV: ./results/batch_comparison.csv"
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,2 @@
 pyyaml
 requests
--- a/test_suite.md
+++ b/test_suite.md
@@ -0,0 +1,452 @@
 # IT Forensics Tests
 This document provides detailed explanations of the IT Forensics tests in the evaluation suite.
 ## Overview
 The forensics tests are designed to evaluate an AI model's ability to:
 1. Interpret raw hex data from various forensic artifacts
 2. Apply domain knowledge of file systems, registry, and network protocols
 3. Perform accurate byte-order conversions (little-endian)
 4. Correlate events and reconstruct timelines
 5. Explain technical concepts clearly
 ## 🔍 Test Breakdown
 ### IT Forensics - File Systems
 #### Test: forensics_mft_01 - MFT Entry Analysis (Basic)
 **Purpose**: Evaluate basic NTFS Master File Table interpretation
 **Key Concepts**:
 - MFT Signature: "FILE" (46 49 4C 45 in hex, ASCII)
 - Entry flags at offset 0x16:
  - 0x01 = In use
  - 0x02 = Directory
 - Sequence number: 16-bit value at offset 0x10 (little-endian)
 **Example Hex Dump**:
 ```bash
 Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 00000000  46 49 4C 45 30 00 03 00 95 1F 23 00 00 00 00 00
 00000010  01 00 01 00 38 00 01 00 A0 01 00 00 00 04 00 00
 ```
 **Expected Analysis**:
 - Signature: "FILE" (bytes 00-03)
 - Update Sequence Offset: 0x0030 (bytes 04-05, little-endian)
 - Update Sequence Size: 0x0003 (bytes 06-07, little-endian)
 - Sequence Number: 0x0001 (bytes 10-11, little-endian)
 - Flags: 0x0001 at offset 0x16 = In use
 **Scoring Criteria**:
 - 5 points: Identifies all fields correctly with offset references
 - 3-4 points: Identifies most fields, minor errors in interpretation
 - 1-2 points: Recognizes MFT but misses key fields
 - 0 points: Cannot identify as MFT entry
 ---
 #### Test: forensics_mft_02 - MFT Entry Analysis (Advanced)
 **Purpose**: Deep understanding of MFT structure
 **Additional Concepts**:
 - Update Sequence Array (USA): Anti-corruption mechanism
 - $LogFile Sequence Number (LSN): Transaction logging
 - First Attribute Offset: Where attribute records begin
 - MFT Entry Flags: Bitfield indicating file properties
 **Key Offsets**:
 - 0x00-0x03: Signature "FILE"
 - 0x04-0x05: Update Sequence Offset
 - 0x06-0x07: Update Sequence Size
 - 0x08-0x0F: $LogFile Sequence Number (LSN, 64-bit)
 - 0x10-0x11: Sequence Number
 - 0x14-0x15: First Attribute Offset
 - 0x16-0x17: Flags (0x01=in use, 0x02=directory)
 **Example Analysis for LSN**:
 ```bash
 Offset 08: EA 3F 00 00 00 00 00 00
 Little-endian 64-bit: 0x0000000000003FEA = 16362 decimal
 ```
 **Scoring Criteria**:
 - 5 points: All fields correct with little-endian conversion shown
 - 3-4 points: Most fields correct, minor calculation errors
 - 1-2 points: Understands structure but significant errors
 - 0 points: Cannot parse MFT header
 ---
 #### Test: forensics_signature_01 - File Signature Identification
 **Purpose**: Recognition of common file magic numbers
 **Magic Numbers to Know**:
 | Signature | File Type | Notes |
 |-----------|-----------|-------|
 | FF D8 FF E0 | JPEG | Often followed by "JFIF" |
 | 89 50 4E 47 0D 0A 1A 0A | PNG | \\x89PNG + line endings |
 | 25 50 44 46 | PDF | "%PDF" in ASCII |
 | 50 4B 03 04 | ZIP | "PK" headers (PKZip) |
 | 52 61 72 21 1A 07 | RAR | "Rar!" + markers |
 | 4D 5A | EXE/DLL | DOS "MZ" header |
 | 7F 45 4C 46 | ELF | Linux executables |
 **Test Example**:
 ```bash
 A) FF D8 FF E0 00 10 4A 46 49 46
   → JPEG (FF D8 FF + JFIF marker)
 B) 50 4B 03 04 14 00 06 00
   → ZIP/DOCX/XLSX (PKZip format)
 ```
 **Scoring Criteria**:
 - 5 points: All signatures identified with explanations
 - 3-4 points: Most correct, understands concept
 - 1-2 points: Recognizes some but misses key ones
 - 0 points: Cannot identify file signatures
 ---
 ### IT Forensics - Registry & Artifacts
 #### Test: forensics_registry_01 - Windows Registry Hive Header
 **Purpose**: Parse Windows Registry binary format
 **Key Structure**:
 ```bash
 Offset  Field                   Size
 0x00    Signature "regf"        4 bytes
 0x04    Primary Seq Number      4 bytes (little-endian)
 0x08    Secondary Seq Number    4 bytes (little-endian)
 0x0C    Timestamp              8 bytes (FILETIME)
 0x14    Major Version          4 bytes
 0x18    Minor Version          4 bytes
 ```
 **Example**:
 ```bash
 Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 00000000  72 65 67 66 E6 07 00 00 E6 07 00 00 00 00 00 00
 Analysis:
 - Signature: "regf" (72 65 67 66)
 - Primary Seq: 0x000007E6 = 2022 decimal
 - Secondary Seq: 0x000007E6 = 2022 decimal
 ```
 **Scoring Criteria**:
 - 5 points: Correct parsing with endianness consideration
 - 3-4 points: Identifies structure, minor errors
 - 1-2 points: Recognizes registry but inaccurate parsing
 - 0 points: Cannot identify registry hive
 ---
 #### Test: forensics_timestamp_01 - FILETIME Conversion
 **Purpose**: Convert Windows timestamps to human-readable format
 **FILETIME Format**:
 - 64-bit value (little-endian)
 - Counts 100-nanosecond intervals
 - Epoch: January 1, 1601 00:00:00 UTC
 **Conversion Process**:
 1. Reverse byte order (little-endian to big-endian)
 2. Convert to decimal
 3. Divide by 10,000,000 to get seconds
 4. Add to Unix epoch conversion factor
 **Example**:
 ```bash
 Hex: 01 D8 93 4B 7C F3 D9 01
 Reversed: 01 D9 F3 7C 4B 93 D8 01
 Decimal: 133,000,000,000,000,001
 Seconds: 13,300,000,000
 Date: Approximately 2023-05-15 (depends on epoch calculation)
 ```
 **Scoring Criteria**:
 - 5 points: Correct conversion with methodology explained
 - 3-4 points: Understands process, calculation errors acceptable
 - 1-2 points: Recognizes FILETIME but significant errors
 - 0 points: Cannot explain conversion
 ### IT Forensics - Memory & Network
 #### Test: forensics_memory_01 - Memory Artifact Identification
 **Purpose**: Extract meaningful data from memory dumps
 **Key Artifacts to Identify**:
 - HTTP headers (GET/POST requests)
 - Session cookies (PHPSESSID, etc.)
 - IP addresses and hostnames
 - User agents
 - Authentication tokens
 **Example Analysis**:
 ```bash
 GET /admin/login.php HTTP/1.1
 Host: 192.168.1.100
 Cookie: PHPSESSID=a3f7d8bc9e2a1d5c
 Forensic Value:
 - Web access to admin panel
 - Target: 192.168.1.100
 - Session: a3f7d8bc9e2a1d5c
 - Timeline: Can correlate with web server logs
 ```
 **Scoring Criteria**:
 - 5 points: All artifacts extracted with forensic significance explained
 - 3-4 points: Most artifacts identified, basic analysis
 - 1-2 points: Recognizes HTTP but misses key details
 - 0 points: Cannot identify artifacts
 ---
 #### Test: forensics_network_01 - TCP Header Analysis
 **Purpose**: Parse TCP packet headers
 **TCP Header Structure** (first 20 bytes):
 ```bash
 Offset  Field                Size       Notes
 0-1     Source Port          16 bits    Big-endian
 2-3     Destination Port     16 bits    Big-endian
 4-7     Sequence Number      32 bits    Big-endian
 8-11    Acknowledgment       32 bits    Big-endian
 12      Data Offset+Flags    8 bits     Upper 4=offset, lower 4=reserved
 13      Flags                8 bits     SYN, ACK, FIN, RST, PSH, URG
 14-15   Window Size          16 bits    Big-endian
 16-17   Checksum             16 bits
 18-19   Urgent Pointer       16 bits
 ```
 **TCP Flags** (byte 13):
 - 0x01: FIN
 - 0x02: SYN
 - 0x04: RST
 - 0x08: PSH
 - 0x10: ACK
 - 0x20: URG
 **Example**:
 ```bash
 Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 00000000  C3 5E 01 BB 6B 8B 9C 41 00 00 00 00 50 02 20 00
 Analysis:
 - Source Port: 0xC35E = 50014
 - Dest Port: 0x01BB = 443 (HTTPS)
 - Sequence: 0x6B8B9C41
 - Flags: 0x02 = SYN (connection initiation)
 - Window: 0x2000 = 8192 bytes
 ```
 **Scoring Criteria**:
 - 5 points: All fields correct with protocol understanding
 - 3-4 points: Most fields correct, minor errors
 - 1-2 points: Basic structure recognized, significant errors
 - 0 points: Cannot parse TCP header
 ---
 ### IT Forensics - Timeline & Log Analysis
 #### Test: forensics_timeline_01 - Event Reconstruction
 **Purpose**: Correlate logs to identify attack patterns
 **Timeline Analysis Skills**:
 1. Chronological ordering
 2. Event correlation across sources
 3. Anomaly identification
 4. Attack pattern recognition
 5. Impact assessment
 **Example Scenario**:
 ```bash
 14:23:15 - Admin login from 10.0.0.5 ✓ Normal
 14:23:47 - Access /etc/passwd ⚠️ Suspicious (enumeration)
 14:24:12 - Write shell.php to web dir 🚨 Malicious (web shell)
 14:24:45 - Netcat listener on 4444 🚨 Malicious (backdoor)
 14:25:01 - External connection 🚨 Compromise (C2 callback)
 14:26:33 - Admin logout
 14:30:00 - Failed login from external 🚨 Lateral movement attempt
 ```
 **Attack Pattern**: Web application compromise → web shell upload → reverse shell → persistence → lateral movement
 **Scoring Criteria**:
 - 5 points: Complete attack narrative with IOCs and recommendations
 - 3-4 points: Identifies compromise, basic timeline
 - 1-2 points: Recognizes suspicious activity, incomplete analysis
 - 0 points: Cannot identify attack pattern
 ---
 ## 🎯 Multi-Turn Conversation Tests
 ### Test: multiturn_01 - Progressive Hex Analysis
 **Purpose**: Maintain context across multiple exchanges while building understanding
 **Turn 1**: File type identification from initial bytes
 **Turn 2**: Structure parsing with offset references
 **Turn 3**: Next steps and deeper analysis
 **Key Evaluation Points**:
 - Remembers initial findings
 - Builds on previous responses
 - Shows progressive understanding
 - Maintains technical accuracy
 ---
 ### Test: multiturn_02 - Forensic Investigation Scenario
 **Purpose**: Simulate real investigation workflow
 **Stages**:
 1. Initial triage (data source identification)
 2. Evidence correlation (connecting artifacts)
 3. Impact assessment (IOC identification, response planning)
 **Scoring Focus**:
 - Logical investigation flow
 - Context retention across turns
 - Practical recommendations
 - Complete picture integration
 ---
 ### Test: multiturn_03 - Technical Depth Building
 **Purpose**: Progress from concept to implementation
 **Progression**:
 1. Concept explanation (NTFS ADS)
 2. Practical application (attack scenarios)
 3. Hands-on implementation (PowerShell commands)
 **Expected Depth**:
 - Turn 1: Clear conceptual understanding
 - Turn 2: Builds on concept with examples
 - Turn 3: Demonstrates practical application
 ---
 ## 📊 Evaluation Guidelines
 ### Little-Endian Conversions
 **Always verify**:
 - Byte order reversal shown
 - Decimal conversion provided
 - Offset references included
 **Example**:
 ```bash
 Bytes at offset 0x10: 42 00
 Little-endian: 0x0042 = 66 decimal
 ```
 ### Hex to ASCII
 Common conversions to know:
 - 0x20-0x7E: Printable ASCII
 - 46 49 4C 45 = "FILE"
 - 50 4B = "PK"
 - 4D 5A = "MZ"
 ### Forensic Significance
 Always ask:
 - What does this artifact tell us?
 - How can it be used in investigation?
 - What are the limitations?
 - What other data sources confirm/refute this?
 ---
 ## 🎓 Recommended Resources
 For deeper understanding:
 - NTFS Documentation (Microsoft)
 - RFC 793 (TCP)
 - File Signatures Database (Gary Kessler)
 - Windows Registry Forensics (Harlan Carvey)
 - The Art of Memory Forensics (Ligh, Case, Levy, Walters)
 ---
 ## ⚖️ Scoring Summary
 **Exceptional (4-5)**:
 - Accurate hex interpretation
 - Correct endianness handling
 - Forensic context provided
 - Clear explanations
 **Pass (2-3)**:
 - Basic accuracy
 - Some interpretation errors
 - Limited context
 - Incomplete explanations
 **Fail (0-1)**:
 - Major misinterpretations
 - No endianness consideration
 - Missing forensic value
 - Incoherent explanations
--- a/test_suite.yaml
+++ b/test_suite.yaml
@@ -0,0 +1,522 @@
 # AI Model Evaluation Test Suite
 # Focus: General reasoning + IT Forensics (Academic)
 metadata:
  version: "1.0"
  author: "AI Evaluation Framework"
  focus_areas:
    - Logic & Reasoning
    - Mathematics & Calculation
    - Instruction Following
    - Creative Writing
    - Code Generation
    - Language Nuance
    - IT Forensics
    - Multi-turn Conversations
 # Scoring rubric for all tests
 scoring_rubric:
  fail: 
    score: 0-1
    description: "Major errors, fails to meet basic requirements"
  pass:
    score: 2-3
    description: "Meets requirements with minor issues"
  exceptional:
    score: 4-5
    description: "Exceeds requirements, demonstrates deep understanding"
 # Individual test categories
 test_categories:
  # ========== GENERAL REASONING TESTS ==========
  - category: "Logic & Reasoning"
    tests:
      - id: "logic_01"
        name: "Family Logic Puzzle"
        type: "single_turn"
        prompt: "Three doctors said that Bill is their brother. Bill says he has no brothers. How many brothers does Bill actually have?"
        evaluation_criteria:
          - "Correctly identifies Bill is a woman/sister"
          - "Answers: 0 brothers"
          - "Explains the logical deduction"
        expected_difficulty: "medium"
      - id: "logic_02"
        name: "Temporal Reasoning"
        type: "single_turn"
        prompt: "If it was two hours ago, it would have been as long after 1:00 PM as it was before 1:00 PM today. What time is it now? Explain your deduction step-by-step."
        evaluation_criteria:
          - "Shows algebraic setup: (t-2) - 13:00 = 13:00 - (t-2)"
          - "Correct answer: 5:00 PM (17:00)"
          - "Clear step-by-step reasoning"
        expected_difficulty: "hard"
  - category: "Mathematics & Calculation"
    tests:
      - id: "math_01"
        name: "Average Speed with Stop"
        type: "single_turn"
        prompt: "If a train travels 240 miles in 3 hours, then stops for 45 minutes, then travels another 180 miles in 2 hours, what is the average speed for the entire journey including the stop?"
        evaluation_criteria:
          - "Total distance: 420 miles"
          - "Total time: 5.75 hours"
          - "Average speed: 73.04 mph (approximately)"
          - "Shows calculation steps"
        expected_difficulty: "medium"
      - id: "math_02"
        name: "Cross-System Fuel Calculation"
        type: "single_turn"
        prompt: "A vehicle consumes 8.5 liters of fuel for every 100 kilometers traveled. If the fuel tank holds 15 gallons, and the car has already traveled 120 miles starting from a full tank, how many kilometers of range are left? (Use: 1 gallon = 3.785 liters; 1 mile = 1.609 km)."
        evaluation_criteria:
          - "Correct unit conversions (gallons to liters, miles to km)"
          - "Accurate fuel consumption calculation"
          - "Remaining range calculation: approximately 570-580 km"
          - "Shows intermediate steps"
        expected_difficulty: "hard"
  - category: "Instruction Following"
    tests:
      - id: "instr_01"
        name: "Photosynthesis Constraints"
        type: "single_turn"
        prompt: "Write exactly 3 sentences about photosynthesis. The first sentence must be exactly 8 words long. The second must contain the word 'chlorophyll'. The third must end with a question mark."
        evaluation_criteria:
          - "Exactly 3 sentences"
          - "First sentence exactly 8 words"
          - "Second contains 'chlorophyll'"
          - "Third ends with '?'"
          - "Content is accurate about photosynthesis"
        expected_difficulty: "medium"
      - id: "instr_02"
        name: "Quantum Entanglement Negative Constraints"
        type: "single_turn"
        prompt: "Summarize the concept of 'Quantum Entanglement' in exactly 4 sentences. 1) The first sentence must be exactly 12 words long. 2) You CANNOT use the words 'particle', 'physics', or 'Einstein' in any part of the response. 3) The third sentence must be a question. 4) The final word of the summary must be 'connected'."
        evaluation_criteria:
          - "Exactly 4 sentences"
          - "First sentence exactly 12 words"
          - "No forbidden words (particle, physics, Einstein)"
          - "Third sentence is a question"
          - "Ends with 'connected'"
        expected_difficulty: "very_hard"
  - category: "Creative Writing"
    tests:
      - id: "creative_01"
        name: "Lighthouse Keeper Story"
        type: "single_turn"
        prompt: "Write a two-paragraph story about a lighthouse keeper who discovers something unusual. Use vivid sensory details."
        evaluation_criteria:
          - "Exactly 2 paragraphs"
          - "Vivid sensory details (sight, sound, smell, touch, taste)"
          - "Coherent narrative"
          - "Creative and engaging"
        expected_difficulty: "medium"
      - id: "creative_02"
        name: "Victorian Greenhouse with Constraints"
        type: "single_turn"
        prompt: "Write a two-paragraph scene of a person entering an abandoned Victorian greenhouse in the middle of a blizzard. Use the 'Show, Don't Tell' technique. You must include at least one metaphor involving glass and one simile involving ghosts. Do not use the words 'cold', 'scary', or 'old'."
        evaluation_criteria:
          - "Two paragraphs"
          - "Shows rather than tells"
          - "Contains glass metaphor"
          - "Contains ghost simile"
          - "No forbidden words (cold, scary, old)"
          - "Atmospheric and evocative"
        expected_difficulty: "hard"
  - category: "Code Generation"
    tests:
      - id: "code_01"
        name: "Duplicate Filter Function"
        type: "single_turn"
        prompt: "Write a Python function that takes a list of integers and returns a new list containing only the numbers that appear exactly twice in the original list. Include example usage."
        evaluation_criteria:
          - "Syntactically correct Python"
          - "Correctly identifies duplicates appearing exactly twice"
          - "Includes example usage"
          - "Handles edge cases"
        expected_difficulty: "medium"
      - id: "code_02"
        name: "Weight Converter with Error Handling"
        type: "single_turn"
        prompt: "Write a Python function `process_measurements` that takes a list of strings representing weights (e.g., '5kg', '12lb', '300g'). The function should convert all weights to grams, filter out any values that exceed 5 kilograms, and return the average of the remaining values. Include try-except blocks for malformed strings and provide three test cases: one with metric, one with imperial, and one with a 'corrupted' string."
        evaluation_criteria:
          - "Correct parsing of weight strings"
          - "Accurate unit conversions (kg, lb, g to grams)"
          - "Proper filtering (> 5kg excluded)"
          - "Robust error handling"
          - "Three distinct test cases provided"
        expected_difficulty: "hard"
  - category: "Language Nuance"
    tests:
      - id: "nuance_01"
        name: "Emphasis Shift Analysis"
        type: "single_turn"
        prompt: "Explain the difference in meaning when different words are emphasized in this sentence: 'I didn't say she stole the money'. Show how the meaning changes with emphasis on each word."
        evaluation_criteria:
          - "Explains emphasis on 'I' (someone else said it)"
          - "Explains emphasis on 'didn't' (denial)"
          - "Explains emphasis on 'say' (implied it)"
          - "Explains emphasis on 'she' (someone else did)"
          - "Explains emphasis on 'stole' (obtained differently)"
          - "Explains emphasis on 'money' (took something else)"
        expected_difficulty: "medium"
      - id: "nuance_02"
        name: "Professional Apology Analysis"
        type: "single_turn"
        prompt: "Compare the social implications and 'hidden' meanings of these three phrases when used in a professional workplace setting after a mistake: 1) 'I'm sorry if you feel that way.' 2) 'I apologize for the oversight.' 3) 'Mistakes were made on my end.' Explain which one is most likely to preserve professional authority vs. which one sounds like 'gaslighting'."
        evaluation_criteria:
          - "Identifies phrase 1 as potentially gaslighting"
          - "Recognizes phrase 2 as genuine accountability"
          - "Analyzes phrase 3 for passive voice implications"
          - "Discusses power dynamics and authority"
          - "Demonstrates understanding of pragmatics"
        expected_difficulty: "hard"
  - category: "Problem Solving & Logistics"
    tests:
      - id: "logistics_01"
        name: "Water Jug Problem"
        type: "single_turn"
        prompt: "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons of water? Explain each step."
        evaluation_criteria:
          - "Provides step-by-step solution"
          - "Reaches exactly 4 gallons"
          - "Logical sequence of pours"
          - "Clear explanation"
        expected_difficulty: "medium"
      - id: "logistics_02"
        name: "Bridge Transport Optimization"
        type: "single_turn"
        prompt: "You need to transport 500 kilograms (approx. 1,102 lbs) of equipment across a bridge. The bridge has a strict limit of 150 kg per trip. You have three crates weighing 70 kg, 80 kg, and 120 kg, plus a variety of smaller 10 kg weights. However, the transport cart itself weighs 25 lbs. Calculate the minimum number of trips required and provide a loading manifest for each trip in both kilograms and pounds."
        evaluation_criteria:
          - "Converts cart weight to kg (≈11.34 kg)"
          - "Accounts for cart weight in each trip"
          - "Calculates effective capacity per trip"
          - "Provides minimum number of trips"
          - "Shows manifest in both kg and lbs"
          - "Reaches exactly 500 kg total"
        expected_difficulty: "very_hard"
  # ========== IT FORENSICS TESTS ==========
  - category: "IT Forensics - File Systems"
    tests:
      - id: "forensics_mft_01"
        name: "MFT Entry Analysis - Basic"
        type: "single_turn"
        prompt: |
          Analyze this hex dump from an NTFS Master File Table (MFT) entry and answer:
          1) What is the signature of this MFT entry?
          2) Is this entry in use or deleted?
          3) What is the sequence number?
          Hex dump (first 48 bytes of MFT entry):
          Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          00000000  46 49 4C 45 30 00 03 00 95 1F 23 00 00 00 00 00
          00000010  01 00 01 00 38 00 01 00 A0 01 00 00 00 04 00 00
          00000020  00 00 00 00 00 00 00 00 06 00 00 00 00 00 00 00
        evaluation_criteria:
          - "Identifies signature as 'FILE' (46 49 4C 45)"
          - "Recognizes entry is in use (based on flags at offset 0x16)"
          - "Correctly reads sequence number from offset 0x10"
          - "Shows understanding of little-endian byte order"
          - "Explains reasoning with offset references"
        expected_difficulty: "hard"
      - id: "forensics_mft_02"
        name: "MFT Entry Analysis - Advanced"
        type: "single_turn"
        prompt: |
          Analyze this complete MFT entry header and extract key metadata:
          Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          00000000  46 49 4C 45 30 00 03 00 EA 3F 00 00 00 00 00 00
          00000010  01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
          00000020  00 00 00 00 00 00 00 00 04 00 00 00 05 00 00 00
          00000030  2A 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
          Questions:
          1) What is the update sequence array offset?
          2) What is the update sequence array size?
          3) What is the $LogFile sequence number (LSN)?
          4) What is the offset to the first attribute?
          5) What are the MFT entry flags (in use/directory)?
        evaluation_criteria:
          - "Identifies USA offset (0x0030 at offset 0x04-0x05)"
          - "Identifies USA size (0x0003 at offset 0x06-0x07)"
          - "Reads LSN correctly (0x00003FEA, little-endian)"
          - "Identifies first attribute offset (0x0038 at offset 0x14-0x15)"
          - "Interprets flags correctly (offset 0x16-0x17)"
          - "Demonstrates understanding of MFT structure"
        expected_difficulty: "very_hard"
      - id: "forensics_signature_01"
        name: "File Signature Identification"
        type: "single_turn"
        prompt: |
          Identify the file types from these hex signatures and explain your reasoning:
          A) FF D8 FF E0 00 10 4A 46 49 46
          B) 50 4B 03 04 14 00 06 00
          C) 89 50 4E 47 0D 0A 1A 0A
          D) 25 50 44 46 2D 31 2E 34
          E) 52 61 72 21 1A 07 00
        evaluation_criteria:
          - "Correctly identifies A as JPEG (FF D8 FF + JFIF)"
          - "Identifies B as ZIP/PKZip (PK headers)"
          - "Identifies C as PNG (\\x89PNG)"
          - "Identifies D as PDF (%PDF-1.4)"
          - "Identifies E as RAR archive"
          - "Explains significance of magic numbers"
        expected_difficulty: "medium"
  - category: "IT Forensics - Registry & Artifacts"
    tests:
      - id: "forensics_registry_01"
        name: "Windows Registry Hive Header"
        type: "single_turn"
        prompt: |
          Analyze this Windows Registry hive header:
          Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          00000000  72 65 67 66 E6 07 00 00 E6 07 00 00 00 00 00 00
          00000010  01 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00
          Questions:
          1) What is the registry hive signature?
          2) What are the primary and secondary sequence numbers?
          3) What is the hive format version?
        evaluation_criteria:
          - "Identifies 'regf' signature (72 65 67 66)"
          - "Reads primary sequence number (0x000007E6 = 2022)"
          - "Reads secondary sequence number (same)"
          - "Identifies format version or major version number"
          - "Demonstrates knowledge of registry forensics"
        expected_difficulty: "hard"
      - id: "forensics_timestamp_01"
        name: "FILETIME Conversion"
        type: "single_turn"
        prompt: |
          Convert these Windows FILETIME values to human-readable UTC timestamps:
          A) 01 D8 93 4B 7C F3 D9 01 (little-endian 64-bit value)
          B) 00 80 3E D5 DE B1 9D 01
          Explain your conversion methodology. (FILETIME = 100-nanosecond intervals since Jan 1, 1601 UTC)
        evaluation_criteria:
          - "Correctly reverses byte order (little-endian)"
          - "Converts to decimal"
          - "Applies FILETIME epoch (Jan 1, 1601)"
          - "Provides reasonable timestamp or shows calculation method"
          - "Explains conversion steps"
        expected_difficulty: "very_hard"
  - category: "IT Forensics - Memory & Network"
    tests:
      - id: "forensics_memory_01"
        name: "Memory Artifact Identification"
        type: "single_turn"
        prompt: |
          You find this ASCII string in a memory dump at offset 0x1A4F3000:
          GET /admin/login.php HTTP/1.1
          Host: 192.168.1.100
          User-Agent: Mozilla/5.0
          Cookie: PHPSESSID=a3f7d8bc9e2a1d5c
          What artifacts can you extract and what do they tell you forensically?
        evaluation_criteria:
          - "Identifies HTTP GET request"
          - "Extracts target URL/path (/admin/login.php)"
          - "Identifies target host IP"
          - "Recognizes session cookie (PHPSESSID)"
          - "Discusses forensic significance (web access, authentication attempt)"
          - "Mentions potential for timeline reconstruction"
        expected_difficulty: "medium"
      - id: "forensics_network_01"
        name: "TCP Header Analysis"
        type: "single_turn"
        prompt: |
          Analyze this TCP header (first 20 bytes):
          Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          00000000  C3 5E 01 BB 6B 8B 9C 41 00 00 00 00 50 02 20 00
          00000010  E6 A1 00 00
          Extract:
          1) Source port
          2) Destination port  
          3) Sequence number
          4) TCP flags (which flags are set?)
          5) Window size
        evaluation_criteria:
          - "Source port: 0xC35E = 50014"
          - "Dest port: 0x01BB = 443 (HTTPS)"
          - "Sequence: 0x6B8B9C41"
          - "Flags: SYN flag set (0x02 in flags byte)"
          - "Window: 0x2000 = 8192"
          - "Shows understanding of TCP header structure"
        expected_difficulty: "hard"
  - category: "IT Forensics - Timeline & Log Analysis"
    tests:
      - id: "forensics_timeline_01"
        name: "Event Reconstruction"
        type: "single_turn"
        prompt: |
          Given these log entries, reconstruct the sequence of events and identify any anomalies:
          2024-01-15 14:23:15 | User 'admin' login successful from 10.0.0.5
          2024-01-15 14:23:47 | File access: /etc/passwd (read) by 'admin'
          2024-01-15 14:24:12 | File access: /var/www/upload/shell.php (write) by 'admin'
          2024-01-15 14:24:45 | New process: nc -l -p 4444 by 'admin'
          2024-01-15 14:25:01 | Network connection: 10.0.0.5:4444 <- 203.0.113.50:52341
          2024-01-15 14:26:33 | User 'admin' logout
          2024-01-15 14:30:00 | Login attempt 'admin' from 203.0.113.50 FAILED
          What likely occurred here from a forensic perspective?
        evaluation_criteria:
          - "Identifies initial legitimate admin login"
          - "Recognizes suspicious file access pattern"
          - "Identifies web shell upload (shell.php)"
          - "Recognizes netcat listener setup"
          - "Identifies reverse shell connection"
          - "Notes external IP attempting access"
          - "Constructs coherent attack narrative"
          - "Identifies this as potential compromise scenario"
        expected_difficulty: "hard"
  # ========== MULTI-TURN CONVERSATION TESTS ==========
  - category: "Multi-turn: Context Retention"
    tests:
      - id: "multiturn_01"
        name: "Progressive Hex Analysis"
        type: "multi_turn"
        turns:
          - turn: 1
            prompt: "I'm going to show you a hex dump in parts. First, here's the beginning of a file:\n\n4D 5A 90 00 03 00 00 00\n\nWhat type of file does this appear to be?"
            evaluation_criteria:
              - "Identifies MZ header (DOS/Windows executable)"
          - turn: 2
            prompt: "Here's more data from offset 0x3C:\n\n00 00 00 00 80 00 00 00\n\nAnd at that offset (0x80) I find: 50 45 00 00\n\nWhat does this tell you about the file structure?"
            evaluation_criteria:
              - "Recognizes PE header offset pointer at 0x3C"
              - "Identifies PE00 signature"
              - "Concludes this is a Windows PE executable"
              - "References information from Turn 1"
          - turn: 3
            prompt: "If I wanted to examine the import table of this PE file, what structure should I look for next, and where is it typically located?"
            evaluation_criteria:
              - "Mentions Import Directory in Data Directory"
              - "References PE Optional Header"
              - "Shows understanding of PE structure from previous turns"
              - "Maintains context across all three turns"
        expected_difficulty: "hard"
      - id: "multiturn_02"
        name: "Forensic Investigation Scenario"
        type: "multi_turn"
        turns:
          - turn: 1
            prompt: "You're investigating a security incident. Initial triage shows unusual outbound traffic on port 443 at 03:42 AM from workstation WS-2471. What data sources should you examine first and why?"
            evaluation_criteria:
              - "Mentions network logs/PCAP"
              - "Suggests endpoint logs"
              - "References firewall/proxy logs"
              - "Mentions timeline context (unusual hour)"
          - turn: 2
            prompt: "Good. The firewall logs show the connection went to IP 198.51.100.47. The user 'jsmith' was logged in. DNS logs show this IP was queried as 'update-server.example.com' just before the connection. What's your next step?"
            evaluation_criteria:
              - "Suggests checking if domain is legitimate"
              - "Recommends threat intelligence lookup"
              - "Proposes examining what data was transferred"
              - "Mentions checking user account activity"
              - "References information from Turn 1"
          - turn: 3
            prompt: "Threat intel shows 198.51.100.47 is a known C2 server. The SSL cert on 443 is self-signed. You find a scheduled task created at 03:40 AM that runs 'C:\\Windows\\Temp\\svchost.exe'. Now what?"
            evaluation_criteria:
              - "Identifies indicators of compromise (C2, self-signed cert)"
              - "Recognizes suspicious scheduled task"
              - "Notes timing correlation (task before connection)"
              - "Recommends containment steps"
              - "Suggests collecting the malicious executable"
              - "Integrates all context from previous turns"
              - "Proposes comprehensive response plan"
        expected_difficulty: "very_hard"
      - id: "multiturn_03"
        name: "Technical Depth Building"
        type: "multi_turn"
        turns:
          - turn: 1
            prompt: "Explain what NTFS Alternate Data Streams (ADS) are in 2-3 sentences."
            evaluation_criteria:
              - "Mentions file system feature of NTFS"
              - "Explains multiple data streams per file"
              - "Notes potential for hiding data"
          - turn: 2
            prompt: "How would an attacker exploit ADS, and how would you detect it during forensics?"
            evaluation_criteria:
              - "Describes hiding malware/data in ADS"
              - "Mentions Zone.Identifier stream"
              - "Explains dir /r command or forensic tools"
              - "Builds on ADS concept from Turn 1"
          - turn: 3
            prompt: "If you found a file 'document.txt:hidden:$DATA' in an investigation, write a PowerShell one-liner to extract its contents."
            evaluation_criteria:
              - "Uses Get-Content with -Stream parameter"
              - "Correctly references the stream name"
              - "Syntax is approximately correct"
              - "Demonstrates progression from concept to practice"
        expected_difficulty: "medium"
  - category: "Multi-turn: Instruction Following"
    tests:
      - id: "multiturn_instr_01"
        name: "Accumulating Constraints"
        type: "multi_turn"
        turns:
          - turn: 1
            prompt: "Write a sentence about cybersecurity that contains exactly 10 words."
            evaluation_criteria:
              - "Exactly 10 words"
              - "Related to cybersecurity"
          - turn: 2
            prompt: "Good. Now write another sentence about cybersecurity with exactly 10 words, but this one must also include the word 'encryption'."
            evaluation_criteria:
              - "Exactly 10 words"
              - "Contains 'encryption'"
              - "About cybersecurity"
              - "Different from Turn 1"
          - turn: 3
            prompt: "Perfect. Now write a third sentence: 10 words, about cybersecurity, must include 'encryption', and must end with a question mark."
            evaluation_criteria:
              - "Exactly 10 words"
              - "Contains 'encryption'"
              - "About cybersecurity"
              - "Ends with '?'"
              - "Different from previous sentences"
              - "Maintains all constraints from previous turns"
        expected_difficulty: "medium"