improvements

2026-01-16 12:48:56 +01:00
parent 514bd9b571
commit 345aa419c7
9 changed files with 3966 additions and 204 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,52 @@
+# AI Model Evaluation Configuration
+# Copy this file to .env and fill in your values
+
+# =============================================================================
+# MODEL UNDER TEST (MUT) - The model being evaluated
+# =============================================================================
+# OpenAI-compatible API endpoint for the model under test
+MUT_ENDPOINT=http://localhost:11434
+
+# API key for the model under test (optional for local endpoints like Ollama)
+MUT_API_KEY=
+
+# Model name/identifier to test
+# Supports multiple models separated by commas for batch testing:
+# MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
+# Or specify a single model:
+MUT_MODEL=qwen3:4b-q4_K_M
+
+# =============================================================================
+# EVALUATOR API - Used for non-interactive mode to automatically score responses
+# =============================================================================
+# OpenAI-compatible API endpoint for the evaluator model
+EVALUATOR_ENDPOINT=http://localhost:11434
+
+# API key for the evaluator API
+EVALUATOR_API_KEY=
+
+# Evaluator model name (should be a capable model for evaluation tasks)
+EVALUATOR_MODEL=qwen3:14b
+
+# Temperature for evaluator (lower = more consistent scoring)
+EVALUATOR_TEMPERATURE=0.3
+
+# =============================================================================
+# TEST CONFIGURATION
+# =============================================================================
+# Path to test suite YAML file
+TEST_SUITE=test_suite.yaml
+
+# Output directory for results
+OUTPUT_DIR=results
+
+# Filter tests by category (optional, leave empty for all categories)
+FILTER_CATEGORY=
+
+# =============================================================================
+# EXECUTION MODE
+# =============================================================================
+# Run in non-interactive mode (true/false)
+# When true, uses EVALUATOR_* settings for automated scoring
+# When false, prompts user for manual evaluation
+NON_INTERACTIVE=false
--- a/.gitignore
+++ b/.gitignore
@@ -174,3 +174,4 @@ cython_debug/
 # PyPI configuration file
 .pypirc

+results/
--- a/README.md
+++ b/README.md
@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
  - Category-wise performance breakdown
  - Difficulty-based analysis
  - CSV export for further analysis
+  - **🌐 Interactive Web Dashboard** (New!)
+    - Visual analytics with charts and graphs
+    - Advanced intelligence metrics
+    - Filtering, sorting, and statistical analysis
+    - Multi-dimensional performance evaluation

 ## Quick Start

@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks

 ```bash
 # Python 3.8+
-pip install pyyaml requests
+pip install -r requirements.txt
+# or manually:
+pip install pyyaml requests python-dotenv
 ```

 ### Installation

 ```bash
 # Clone or download the files
-# Ensure these files are in your working directory:
-# - ai_eval.py
-# - analyze_results.py
-# - test_suite.yaml
+# Copy the example environment file
+cp .env.example .env
+
+# Edit .env with your settings
+# - Configure the model under test (MUT_*)
+# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
+# - Set NON_INTERACTIVE=true for automated evaluation
+nano .env
+```
+
+### Configuration with .env File (Recommended)
+
+The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
+
+```bash
+# Model Under Test (MUT) - The model being evaluated
+MUT_ENDPOINT=http://localhost:11434
+MUT_API_KEY=                         # Optional for local endpoints
+MUT_MODEL=qwen3:4b-q4_K_M
+
+# Evaluator API - For non-interactive automated scoring
+EVALUATOR_ENDPOINT=http://localhost:11434
+EVALUATOR_API_KEY=                   # Optional
+EVALUATOR_MODEL=qwen3:14b           # Use a capable model for evaluation
+EVALUATOR_TEMPERATURE=0.3           # Lower = more consistent scoring
+
+# Execution Mode
+NON_INTERACTIVE=false               # Set to true for automated evaluation
+TEST_SUITE=test_suite.yaml
+OUTPUT_DIR=results
+FILTER_CATEGORY=                    # Optional: filter by category
 ```

 ### Basic Usage

-#### 1. Test a Single Model
+#### 0. Test Connectivity (Dry Run)
+
+Before running the full test suite, verify that your API endpoints are reachable and properly configured:

 ```bash
-# For Ollama (default: http://localhost:11434)
+# Test MUT endpoint connectivity
+python ai_eval.py --dry-run
+
+# Test with specific configuration
+python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
+
+# Test non-interactive mode (tests both MUT and evaluator endpoints)
+python ai_eval.py --non-interactive --dry-run
+
+# Test multiple models
+python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
+```
+
+The dry-run mode will:
+- Test connectivity to the model under test endpoint(s)
+- Verify authentication (API keys)
+- Confirm model availability
+- Test evaluator endpoint if in non-interactive mode
+- Exit with success/failure status
+
+#### 1. Interactive Mode (Manual Evaluation)
+
+```bash
+# Using .env file
+python ai_eval.py
+
+# Or with command-line arguments
 python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M

 # For other endpoints with API key
@@ -69,33 +131,94 @@ python ai_eval.py \
  --model your-model-name
 ```

-#### 2. Test Multiple Models (Quantization Comparison)
+#### 2. Non-Interactive Mode (Automated Evaluation)
+
+Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.

 ```bash
-# Test different quantizations of qwen3:4b
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
+# Configure .env file
+NON_INTERACTIVE=true
+EVALUATOR_ENDPOINT=http://localhost:11434
+EVALUATOR_MODEL=qwen3:14b

-# Test different model sizes
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
-python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
+# Run the test
+python ai_eval.py
+
+# Or with command-line arguments
+python ai_eval.py \
+  --endpoint http://localhost:11434 \
+  --model qwen3:4b-q4_K_M \
+  --non-interactive \
+  --evaluator-endpoint http://localhost:11434 \
+  --evaluator-model qwen3:14b
 ```

-#### 3. Filter by Category
+**How Non-Interactive Mode Works:**
+- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
+- The evaluator model analyzes the response and returns a score (0-5) with notes
+- This enables automated, consistent scoring across multiple model runs
+- The evaluator uses a specialized system prompt designed for objective evaluation
+
+**Choosing an Evaluator Model:**
+- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
+- The evaluator model should be more capable than the model under test
+- Lower temperature (0.3) provides more consistent scoring
+
+#### 3. Test Multiple Models (Batch Mode)
+
+Test multiple models in one run by specifying comma-separated model names:
+
+```bash
+# In .env file
+MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
+
+# Run batch test
+python ai_eval.py
+
+# Or via command line
+python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
+```
+
+The script will automatically test each model sequentially and save individual results.
+
+#### 4. Filter by Category

 ```bash
 # Test only IT Forensics categories
-python ai_eval.py \
-  --endpoint http://localhost:11434 \
-  --model qwen3:4b \
-  --category "IT Forensics - File Systems"
+python ai_eval.py --category "IT Forensics - File Systems"
 ```

-#### 4. Analyze Results
+#### 5. Analyze Results

 ```bash
-# Compare all tested models
+## Analyzing Results
+
+### Interactive Web Dashboard (Recommended)
+
+Launch the comprehensive web interface for visual analysis:
+
+```bash
+# Start web dashboard (opens automatically in browser)
+python analyze_results.py --web
+
+# Custom host/port
+python analyze_results.py --web --host 0.0.0.0 --port 8080
+```
+
+**Features:**
+- 📊 Visual comparison charts and graphs
+- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
+- 🔍 Interactive filtering and sorting
+- 📈 Statistical analysis (consistency, robustness)
+- 📂 Category and difficulty breakdowns
+- 💡 Multi-dimensional cognitive evaluation
+
+See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
+
+### Command-Line Analysis
+
+```bash
+# Compare all models
 python analyze_results.py --compare

 # Detailed report for specific model
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
 ├── ai_eval.py              # Main testing script
 ├── analyze_results.py      # Results analysis and comparison
 ├── test_suite.yaml         # Test definitions
+├── .env.example            # Configuration template
 ├── results/                # Auto-created results directory
 │   ├── qwen3_4b-q4_K_M_latest.json
 │   ├── qwen3_4b-q8_0_latest.json
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
 └── README.md
 ```

+## Configuration Reference
+
+### Environment Variables (.env file)
+
+All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
+
+#### Model Under Test (MUT)
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
+| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
+| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
+
+#### Evaluator Configuration (for Non-Interactive Mode)
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
+| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
+| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
+| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
+
+#### Test Configuration
+
+| Variable | Description | Example |
+| --- | --- | --- |
+| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
+| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
+| `OUTPUT_DIR` | Results output directory | `results` |
+| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
+
+### Command-Line Arguments
+
+All environment variables have corresponding command-line flags:
+
+```bash
+python ai_eval.py --help
+
+Options:
+  --endpoint ENDPOINT                Model under test endpoint
+  --api-key API_KEY                  Model under test API key
+  --model MODEL                      Model name to test
+  --test-suite FILE                  Test suite YAML file
+  --output-dir DIR                   Output directory
+  --category CATEGORY                Filter by category
+  --non-interactive                  Enable automated evaluation
+  --evaluator-endpoint ENDPOINT      Evaluator API endpoint
+  --evaluator-api-key KEY            Evaluator API key
+  --evaluator-model MODEL            Evaluator model name
+  --evaluator-temperature TEMP       Evaluator temperature
+```
+
+
 ## Advanced Usage

 ### Custom Test Suite
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
      expected_difficulty: "medium"  # medium, hard, very_hard
 ```

-### Batch Testing Script
+### Batch Testing Examples

-Create `batch_test.sh`:
+Testing multiple models using the `.env` configuration:

 ```bash
-#!/bin/bash
+# Configure .env with multiple models
+cp .env.example .env
+nano .env

-ENDPOINT="http://localhost:11434"
+# Set multiple models (comma-separated)
+MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M

-# Test all qwen3:4b quantizations
-for quant in q4_K_M q8_0 fp16; do
-    echo "Testing qwen3:4b-${quant}..."
-    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
-done
+# Run batch tests
+python ai_eval.py

-# Test all sizes with q4_K_M
-for size in 4b 8b 14b; do
-    echo "Testing qwen3:${size}-q4_K_M..."
-    python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
-done
+# Or via command line
+python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M

-# Generate comparison
+# Generate comparison after testing
 python analyze_results.py --compare
 ```

@@ -244,8 +419,8 @@ python analyze_results.py --compare
 For OpenAI-compatible cloud services:

 ```bash
-python ai_eval.py \
-  --endpoint https://api.service.com \
-  --api-key your-api-key \
-  --model model-name
+# In .env file
+MUT_ENDPOINT=https://api.service.com
+MUT_API_KEY=your-api-key
+MUT_MODEL=model-name
 ```
--- a/ai_eval.py
+++ b/ai_eval.py
--- a/analyze_results.py
+++ b/analyze_results.py
--- a/batch_test.sh
+++ b/batch_test.sh
@@ -1,85 +0,0 @@
-#!/bin/bash
-# Batch Test Script for AI Model Evaluation
-# Tests multiple models and generates comparison report
-
-# Configuration
-ENDPOINT="${ENDPOINT:-http://localhost:11434}"
-API_KEY="${API_KEY:-}"
-
-# Color output
-GREEN='\033[0;32m'
-BLUE='\033[0;34m'
-YELLOW='\033[1;33m'
-NC='\033[0m' # No Color
-
-echo -e "${BLUE}========================================${NC}"
-echo -e "${BLUE}AI Model Batch Testing${NC}"
-echo -e "${BLUE}========================================${NC}"
-echo ""
-echo "Endpoint: $ENDPOINT"
-echo "API Key: ${API_KEY:0:10}${API_KEY:+...}"
-echo ""
-
-# Function to run test
-run_test() {
-    local model=$1
-    echo -e "${GREEN}Testing: $model${NC}"
-    
-    if [ -z "$API_KEY" ]; then
-        python ai_eval.py --endpoint "$ENDPOINT" --model "$model"
-    else
-        python ai_eval.py --endpoint "$ENDPOINT" --api-key "$API_KEY" --model "$model"
-    fi
-    
-    if [ $? -eq 0 ]; then
-        echo -e "${GREEN}✓ Completed: $model${NC}"
-    else
-        echo -e "${YELLOW}⚠ Failed or interrupted: $model${NC}"
-    fi
-    echo ""
-}
-
-# Test qwen3:4b models with different quantizations
-echo -e "${BLUE}=== Testing qwen3:4b with different quantizations ===${NC}"
-echo ""
-
-models_4b=(
-    "qwen3:4b-q4_K_M"
-    "qwen3:4b-q8_0"
-    "qwen3:4b-fp16"
-)
-
-for model in "${models_4b[@]}"; do
-    run_test "$model"
-done
-
-# Test different model sizes with q4_K_M quantization
-echo -e "${BLUE}=== Testing different model sizes (q4_K_M) ===${NC}"
-echo ""
-
-models_sizes=(
-    "qwen3:4b-q4_K_M"
-    "qwen3:8b-q4_K_M"
-    "qwen3:14b-q4_K_M"
-)
-
-for model in "${models_sizes[@]}"; do
-    run_test "$model"
-done
-
-# Generate comparison report
-echo -e "${BLUE}========================================${NC}"
-echo -e "${BLUE}Generating Comparison Report${NC}"
-echo -e "${BLUE}========================================${NC}"
-echo ""
-
-python analyze_results.py --compare
-python analyze_results.py --export batch_comparison.csv
-
-echo ""
-echo -e "${GREEN}========================================${NC}"
-echo -e "${GREEN}Batch Testing Complete!${NC}"
-echo -e "${GREEN}========================================${NC}"
-echo ""
-echo "Results saved in ./results/"
-echo "Comparison CSV: ./results/batch_comparison.csv"
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,2 +1,5 @@
 pyyaml
-requests
+requests
+python-dotenv
+flask
+numpy
--- a/templates/dashboard.html
+++ b/templates/dashboard.html
@@ -0,0 +1,977 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>LLM Evaluation Dashboard</title>
+    <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
+    <script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
+    <style>
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        
+        :root {
+            --bg-gradient-start: #667eea;
+            --bg-gradient-end: #764ba2;
+            --card-bg: #ffffff;
+            --text-primary: #333333;
+            --text-secondary: #666666;
+            --border-color: #e0e0e0;
+            --stat-card-bg: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
+            --shadow: rgba(0,0,0,0.1);
+            --shadow-hover: rgba(0,0,0,0.15);
+        }
+        
+        body.dark-mode {
+            --bg-gradient-start: #1a1a2e;
+            --bg-gradient-end: #16213e;
+            --card-bg: #0f1419;
+            --text-primary: #e0e0e0;
+            --text-secondary: #a0a0a0;
+            --border-color: #2a2a3e;
+            --stat-card-bg: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
+            --shadow: rgba(0,0,0,0.3);
+            --shadow-hover: rgba(0,0,0,0.5);
+        }
+        
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
+            background: linear-gradient(135deg, var(--bg-gradient-start) 0%, var(--bg-gradient-end) 100%);
+            color: var(--text-primary);
+            min-height: 100vh;
+            padding: 20px;
+            transition: all 0.3s ease;
+        }
+        
+        .container {
+            max-width: 1400px;
+            margin: 0 auto;
+        }
+        
+        header {
+            background: var(--card-bg);
+            padding: 30px;
+            border-radius: 15px;
+            box-shadow: 0 10px 40px var(--shadow);
+            margin-bottom: 30px;
+            position: relative;
+        }
+        
+        .theme-toggle {
+            position: absolute;
+            top: 30px;
+            right: 30px;
+            background: var(--border-color);
+            border: none;
+            padding: 10px 20px;
+            border-radius: 20px;
+            cursor: pointer;
+            font-size: 1em;
+            transition: all 0.3s;
+        }
+        
+        .theme-toggle:hover {
+            transform: scale(1.05);
+            box-shadow: 0 4px 15px var(--shadow-hover);
+        }
+        
+        h1 {
+            font-size: 2.5em;
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+            margin-bottom: 10px;
+        }
+        
+        .subtitle {
+            color: var(--text-secondary);
+            font-size: 1.1em;
+        }
+        
+        .tabs {
+            display: flex;
+            gap: 10px;
+            margin-bottom: 20px;
+            flex-wrap: wrap;
+        }
+        
+        .tab {
+            background: var(--card-bg);
+            border: none;
+            padding: 12px 24px;
+            border-radius: 8px;
+            cursor: pointer;
+            font-size: 1em;
+            transition: all 0.3s;
+            box-shadow: 0 2px 10px var(--shadow);
+            color: var(--text-primary);
+        }
+        
+        .tab:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 4px 15px var(--shadow-hover);
+        }
+        
+        .tab.active {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+        }
+        
+        .content-panel {
+            display: none;
+            background: var(--card-bg);
+            padding: 30px;
+            border-radius: 15px;
+            box-shadow: 0 10px 40px var(--shadow);
+            animation: fadeIn 0.3s;
+        }
+        
+        .content-panel.active {
+            display: block;
+        }
+        
+        @keyframes fadeIn {
+            from { opacity: 0; transform: translateY(10px); }
+            to { opacity: 1; transform: translateY(0); }
+        }
+        
+        .stats-grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
+            gap: 20px;
+            margin-bottom: 30px;
+        }
+        
+        .stat-card {
+            background: var(--stat-card-bg);
+            padding: 20px;
+            border-radius: 10px;
+            text-align: center;
+        }
+        
+        .stat-card h3 {
+            font-size: 0.9em;
+            color: var(--text-secondary);
+            margin-bottom: 10px;
+            text-transform: uppercase;
+        }
+        
+        .stat-card .value {
+            font-size: 2.5em;
+            font-weight: bold;
+            color: #667eea;
+        }
+        
+        .chart-container {
+            position: relative;
+            height: 400px;
+            margin-bottom: 30px;
+        }
+        
+        .controls {
+            display: flex;
+            gap: 15px;
+            margin-bottom: 20px;
+            flex-wrap: wrap;
+        }
+        
+        select, input {
+            padding: 10px 15px;
+            border: 2px solid var(--border-color);
+            border-radius: 8px;
+            font-size: 1em;
+            background: var(--card-bg);
+            color: var(--text-primary);
+            cursor: pointer;
+            transition: border-color 0.3s;
+        }
+        
+        select:hover, input:hover {
+            border-color: #667eea;
+        }
+        
+        select:focus, input:focus {
+            outline: none;
+            border-color: #764ba2;
+        }
+        
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            margin-top: 20px;
+        }
+        
+        th, td {
+            padding: 12px;
+            text-align: left;
+            border-bottom: 1px solid var(--border-color);
+        }
+        
+        th {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            font-weight: 600;
+            cursor: pointer;
+            user-select: none;
+        }
+        
+        th:hover {
+            opacity: 0.9;
+        }
+        
+        tr:hover {
+            background: var(--border-color);
+        }
+        
+        .score-badge {
+            display: inline-block;
+            padding: 5px 12px;
+            border-radius: 20px;
+            font-weight: bold;
+            font-size: 0.9em;
+        }
+        
+        .score-exceptional {
+            background: #10b981;
+            color: white;
+        }
+        
+        .score-pass {
+            background: #f59e0b;
+            color: white;
+        }
+        
+        .score-fail {
+            background: #ef4444;
+            color: white;
+        }
+        
+        .loading {
+            text-align: center;
+            padding: 40px;
+            color: var(--text-secondary);
+        }
+        
+        .spinner {
+            border: 3px solid var(--border-color);
+            border-top: 3px solid #667eea;
+            border-radius: 50%;
+            width: 40px;
+            height: 40px;
+            animation: spin 1s linear infinite;
+            margin: 20px auto;
+        }
+        
+        @keyframes spin {
+            0% { transform: rotate(0deg); }
+            100% { transform: rotate(360deg); }
+        }
+        
+        .model-selector {
+            display: flex;
+            gap: 10px;
+            flex-wrap: wrap;
+            margin-bottom: 20px;
+        }
+        
+        .model-chip {
+            padding: 8px 16px;
+            border-radius: 20px;
+            border: 2px solid #667eea;
+            background: var(--card-bg);
+            color: var(--text-primary);
+            cursor: pointer;
+            transition: all 0.3s;
+        }
+        
+        .model-chip:hover {
+            background: #667eea;
+            color: white;
+        }
+        
+        .model-chip.selected {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+        }
+        
+        .metric-card {
+            background: var(--card-bg);
+            border: 2px solid var(--border-color);
+            border-radius: 10px;
+            padding: 20px;
+            margin-bottom: 20px;
+        }
+        
+        .metric-card h3 {
+            color: #667eea;
+            margin-bottom: 15px;
+        }
+        
+        .progress-bar {
+            background: var(--border-color);
+            height: 30px;
+            border-radius: 15px;
+            overflow: hidden;
+            margin: 10px 0;
+            position: relative;
+            cursor: help;
+        }
+        
+        .progress-fill {
+            height: 100%;
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            transition: width 0.5s;
+            display: flex;
+            align-items: center;
+            justify-content: flex-end;
+            padding-right: 10px;
+            color: white;
+            font-weight: bold;
+        }
+        
+        /* Tooltip styles */
+        .tooltip {
+            position: relative;
+            display: inline-block;
+        }
+        
+        .tooltip .tooltiptext {
+            visibility: hidden;
+            width: 300px;
+            background-color: rgba(0, 0, 0, 0.9);
+            color: #fff;
+            text-align: left;
+            border-radius: 8px;
+            padding: 12px;
+            position: absolute;
+            z-index: 1000;
+            bottom: 125%;
+            left: 50%;
+            margin-left: -150px;
+            opacity: 0;
+            transition: opacity 0.3s;
+            font-size: 0.85em;
+            line-height: 1.4;
+            box-shadow: 0 4px 20px rgba(0,0,0,0.3);
+        }
+        
+        .tooltip .tooltiptext::after {
+            content: "";
+            position: absolute;
+            top: 100%;
+            left: 50%;
+            margin-left: -5px;
+            border-width: 5px;
+            border-style: solid;
+            border-color: rgba(0, 0, 0, 0.9) transparent transparent transparent;
+        }
+        
+        .tooltip:hover .tooltiptext {
+            visibility: visible;
+            opacity: 1;
+        }
+        
+        .tooltiptext code {
+            background: rgba(255, 255, 255, 0.1);
+            padding: 2px 6px;
+            border-radius: 3px;
+            font-family: monospace;
+            font-size: 0.9em;
+        }
+        
+        .tooltiptext strong {
+            color: #667eea;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <button class="theme-toggle" onclick="toggleTheme()">🌓 Toggle Dark Mode</button>
+            <h1>🧠 LLM Evaluation Dashboard</h1>
+            <p class="subtitle">Comprehensive Intelligence & Performance Analysis</p>
+        </header>
+        
+        <div class="tabs">
+            <button class="tab active" onclick="switchTab('overview')">📊 Overview</button>
+            <button class="tab" onclick="switchTab('comparison')">⚔️ Model Comparison</button>
+            <button class="tab" onclick="switchTab('intelligence')">🎯 Intelligence Metrics</button>
+            <button class="tab" onclick="switchTab('categories')">📂 Category Analysis</button>
+            <button class="tab" onclick="switchTab('details')">🔍 Detailed Results</button>
+        </div>
+        
+        <div id="overview" class="content-panel active">
+            <h2>System Overview</h2>
+            <div class="stats-grid" id="overviewStats">
+                <div class="loading">
+                    <div class="spinner"></div>
+                    Loading data...
+                </div>
+            </div>
+            <div class="chart-container">
+                <canvas id="overviewChart"></canvas>
+            </div>
+        </div>
+        
+        <div id="comparison" class="content-panel">
+            <h2>Model Performance Comparison</h2>
+            <div class="controls">
+                <select id="metricSelect" onchange="updateComparisonChart()">
+                    <option value="average">Average Score</option>
+                    <option value="pass_rate">Pass Rate</option>
+                    <option value="exceptional_rate">Exceptional Rate</option>
+                    <option value="consistency">Consistency</option>
+                    <option value="robustness">Robustness</option>
+                </select>
+            </div>
+            <div class="chart-container">
+                <canvas id="comparisonChart"></canvas>
+            </div>
+        </div>
+        
+        <div id="intelligence" class="content-panel">
+            <h2>Intelligence Metrics Analysis</h2>
+            <p style="margin-bottom: 20px; color: #666;">
+                Advanced metrics evaluating different dimensions of AI intelligence and reasoning capabilities.
+            </p>
+            <div id="intelligenceMetrics">
+                <div class="loading">
+                    <div class="spinner"></div>
+                    Calculating intelligence metrics...
+                </div>
+            </div>
+        </div>
+        
+        <div id="categories" class="content-panel">
+            <h2>Performance by Category</h2>
+            <div class="controls">
+                <select id="categorySelect" onchange="updateCategoryChart()">
+                    <option value="">Loading categories...</option>
+                </select>
+            </div>
+            <div class="chart-container">
+                <canvas id="categoryChart"></canvas>
+            </div>
+        </div>
+        
+        <div id="details" class="content-panel">
+            <h2>Detailed Test Results</h2>
+            <div class="controls">
+                <select id="modelSelect" onchange="loadModelDetails()">
+                    <option value="">Select a model...</option>
+                </select>
+                <input type="text" id="searchInput" placeholder="Search tests..." onkeyup="filterTable()">
+                <select id="filterCategory" onchange="filterTable()">
+                    <option value="">All Categories</option>
+                </select>
+                <select id="filterScore" onchange="filterTable()">
+                    <option value="">All Scores</option>
+                    <option value="exceptional">Exceptional (4-5)</option>
+                    <option value="pass">Pass (2-3)</option>
+                    <option value="fail">Fail (0-1)</option>
+                </select>
+            </div>
+            <div id="detailsTable">
+                <p class="loading">Select a model to view detailed results</p>
+            </div>
+        </div>
+    </div>
+    
+    <script>
+        let comparisonData = null;
+        let statisticsData = null;
+        let intelligenceData = null;
+        let currentModelDetails = null;
+        
+        // Theme toggle functionality
+        function toggleTheme() {
+            document.body.classList.toggle('dark-mode');
+            const isDark = document.body.classList.contains('dark-mode');
+            localStorage.setItem('darkMode', isDark ? 'enabled' : 'disabled');
+        }
+        
+        // Load theme preference
+        function loadThemePreference() {
+            const darkMode = localStorage.getItem('darkMode');
+            if (darkMode === 'enabled') {
+                document.body.classList.add('dark-mode');
+            }
+        }
+        
+        // Tab switching
+        function switchTab(tabName) {
+            document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
+            document.querySelectorAll('.content-panel').forEach(p => p.classList.remove('active'));
+            
+            event.target.classList.add('active');
+            document.getElementById(tabName).classList.add('active');
+        }
+        
+        // Initialize dashboard
+        async function initDashboard() {
+            loadThemePreference();
+            await loadOverview();
+            await loadComparison();
+            await loadStatistics();
+            await loadIntelligenceMetrics();
+            populateModelSelector();
+        }
+        
+        async function loadOverview() {
+            try {
+                const response = await axios.get('/api/comparison');
+                comparisonData = response.data;
+                
+                const models = Object.keys(comparisonData.models);
+                const totalTests = models.reduce((sum, model) => 
+                    sum + comparisonData.models[model].metadata.total_tests, 0);
+                const avgScore = models.reduce((sum, model) => 
+                    sum + (comparisonData.models[model].overall_stats.average || 0), 0) / models.length;
+                
+                const statsHtml = `
+                    <div class="stat-card">
+                        <h3>Models Evaluated</h3>
+                        <div class="value">${models.length}</div>
+                    </div>
+                    <div class="stat-card">
+                        <h3>Total Tests</h3>
+                        <div class="value">${totalTests}</div>
+                    </div>
+                    <div class="stat-card">
+                        <h3>Average Score</h3>
+                        <div class="value">${avgScore.toFixed(2)}</div>
+                    </div>
+                    <div class="stat-card">
+                        <h3>Categories</h3>
+                        <div class="value">${comparisonData.categories.length}</div>
+                    </div>
+                `;
+                
+                document.getElementById('overviewStats').innerHTML = statsHtml;
+                
+                // Create overview chart
+                const ctx = document.getElementById('overviewChart').getContext('2d');
+                new Chart(ctx, {
+                    type: 'bar',
+                    data: {
+                        labels: models,
+                        datasets: [{
+                            label: 'Average Score',
+                            data: models.map(m => comparisonData.models[m].overall_stats.average || 0),
+                            backgroundColor: 'rgba(102, 126, 234, 0.6)',
+                            borderColor: 'rgba(102, 126, 234, 1)',
+                            borderWidth: 2
+                        }]
+                    },
+                    options: {
+                        responsive: true,
+                        maintainAspectRatio: false,
+                        scales: {
+                            y: {
+                                beginAtZero: true,
+                                max: 5
+                            }
+                        }
+                    }
+                });
+                
+            } catch (error) {
+                console.error('Error loading overview:', error);
+            }
+        }
+        
+        async function loadComparison() {
+            updateComparisonChart();
+        }
+        
+        async function updateComparisonChart() {
+            if (!comparisonData) return;
+            
+            const metric = document.getElementById('metricSelect').value;
+            const models = Object.keys(comparisonData.models);
+            
+            let data, label;
+            
+            if (metric === 'consistency' || metric === 'robustness') {
+                if (!statisticsData) {
+                    await loadStatistics();
+                }
+                const index = statisticsData.models.indexOf(models[0]);
+                data = models.map((m, i) => statisticsData[metric + '_score'][i]);
+                label = metric.charAt(0).toUpperCase() + metric.slice(1) + ' Score';
+            } else {
+                data = models.map(m => comparisonData.models[m].overall_stats[metric] || 0);
+                label = metric.split('_').map(w => w.charAt(0).toUpperCase() + w.slice(1)).join(' ');
+            }
+            
+            const ctx = document.getElementById('comparisonChart');
+            if (window.comparisonChartInstance) {
+                window.comparisonChartInstance.destroy();
+            }
+            
+            window.comparisonChartInstance = new Chart(ctx, {
+                type: 'radar',
+                data: {
+                    labels: models,
+                    datasets: [{
+                        label: label,
+                        data: data,
+                        backgroundColor: 'rgba(118, 75, 162, 0.2)',
+                        borderColor: 'rgba(118, 75, 162, 1)',
+                        pointBackgroundColor: 'rgba(118, 75, 162, 1)',
+                        pointBorderColor: '#fff',
+                        pointHoverBackgroundColor: '#fff',
+                        pointHoverBorderColor: 'rgba(118, 75, 162, 1)'
+                    }]
+                },
+                options: {
+                    responsive: true,
+                    maintainAspectRatio: false,
+                    scales: {
+                        r: {
+                            beginAtZero: true
+                        }
+                    }
+                }
+            });
+        }
+        
+        async function loadStatistics() {
+            try {
+                const response = await axios.get('/api/statistics');
+                statisticsData = response.data;
+            } catch (error) {
+                console.error('Error loading statistics:', error);
+            }
+        }
+        
+        async function loadIntelligenceMetrics() {
+            try {
+                const response = await axios.get('/api/intelligence_metrics');
+                intelligenceData = response.data;
+                
+                let html = '';
+                
+                for (const [model, metrics] of Object.entries(intelligenceData)) {
+                    html += `
+                        <div class="metric-card">
+                            <h3>${model}</h3>
+                            
+                            <div style="margin-bottom: 20px;" class="tooltip">
+                                <strong>Overall Intelligence Score:</strong>
+                                <span class="tooltiptext">
+                                    <strong>Calculation:</strong><br>
+                                    Overall = (IQ × 0.5) + (Adaptability × 0.3) + (Problem-Solving × 0.2)<br><br>
+                                    <strong>Values:</strong><br>
+                                    • IQ: ${metrics.iq_score.toFixed(1)}<br>
+                                    • Adaptability: ${metrics.adaptability.toFixed(1)}%<br>
+                                    • Problem-Solving: ${metrics.problem_solving_depth.toFixed(1)}<br><br>
+                                    Result: ${metrics.overall_intelligence.toFixed(1)}
+                                </span>
+                                <div class="progress-bar">
+                                    <div class="progress-fill" style="width: ${metrics.overall_intelligence}%">
+                                        ${metrics.overall_intelligence.toFixed(1)}
+                                    </div>
+                                </div>
+                            </div>
+                            
+                            <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); gap: 15px;">
+                                <div class="tooltip">
+                                    <strong>IQ Score:</strong>
+                                    <span class="tooltiptext">
+                                        <strong>Weighted Average of Dimensions:</strong><br><br>
+                                        ${Object.entries(metrics.dimensions).map(([dim, data]) => {
+                                            const weights = {
+                                                'logical_reasoning': 1.5,
+                                                'mathematical_ability': 1.3,
+                                                'technical_knowledge': 1.4,
+                                                'instruction_following': 1.2,
+                                                'linguistic_nuance': 1.1,
+                                                'creativity': 1.0,
+                                                'conversational_depth': 1.0
+                                            };
+                                            return `• ${dim.replace(/_/g, ' ')}: ${data.score.toFixed(1)} × ${weights[dim] || 1.0}`;
+                                        }).join('<br>')}<br><br>
+                                        Normalized to 0-100 scale
+                                    </span>
+                                    <div class="progress-bar">
+                                        <div class="progress-fill" style="width: ${metrics.iq_score}%">
+                                            ${metrics.iq_score.toFixed(1)}
+                                        </div>
+                                    </div>
+                                </div>
+                                
+                                <div class="tooltip">
+                                    <strong>Adaptability:</strong>
+                                    <span class="tooltiptext">
+                                        <strong>Cross-Category Performance:</strong><br><br>
+                                        Measures versatility across different task types.<br><br>
+                                        Formula: (Categories with avg ≥ 2.5) / (Total categories) × 100<br><br>
+                                        Higher score = more versatile model
+                                    </span>
+                                    <div class="progress-bar">
+                                        <div class="progress-fill" style="width: ${metrics.adaptability}%">
+                                            ${metrics.adaptability.toFixed(1)}%
+                                        </div>
+                                    </div>
+                                </div>
+                                
+                                <div class="tooltip">
+                                    <strong>Problem-Solving Depth:</strong>
+                                    <span class="tooltiptext">
+                                        <strong>Performance on Challenging Tasks:</strong><br><br>
+                                        Average score on "hard" and "very_hard" difficulty tests.<br><br>
+                                        Formula: (Avg score on hard tests) × 20<br><br>
+                                        Tests critical thinking and complex reasoning
+                                    </span>
+                                    <div class="progress-bar">
+                                        <div class="progress-fill" style="width: ${metrics.problem_solving_depth}%">
+                                            ${metrics.problem_solving_depth.toFixed(1)}
+                                        </div>
+                                    </div>
+                                </div>
+                            </div>
+                            
+                            <h4 style="margin-top: 20px; color: #764ba2;">Cognitive Dimensions:</h4>
+                            <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 10px; margin-top: 10px;">
+                    `;
+                    
+                    const dimensionWeights = {
+                        'logical_reasoning': 1.5,
+                        'mathematical_ability': 1.3,
+                        'technical_knowledge': 1.4,
+                        'instruction_following': 1.2,
+                        'linguistic_nuance': 1.1,
+                        'creativity': 1.0,
+                        'conversational_depth': 1.0
+                    };
+                    
+                    for (const [dim, data] of Object.entries(metrics.dimensions)) {
+                        const weight = dimensionWeights[dim] || 1.0;
+                        html += `
+                            <div class="tooltip">
+                                <small>${dim.replace(/_/g, ' ').toUpperCase()}</small>
+                                <span class="tooltiptext">
+                                    <strong>${dim.replace(/_/g, ' ').toUpperCase()}</strong><br><br>
+                                    Score: <code>${data.score.toFixed(2)}/5.00</code><br>
+                                    Weight in IQ: <code>${weight}</code><br>
+                                    Tests evaluated: <code>${data.count}</code><br><br>
+                                    Normalized: ${data.normalized.toFixed(1)}%
+                                </span>
+                                <div class="progress-bar" style="height: 20px;">
+                                    <div class="progress-fill" style="width: ${data.normalized}%; font-size: 0.8em;">
+                                        ${data.score.toFixed(1)}
+                                    </div>
+                                </div>
+                            </div>
+                        `;
+                    }
+                    
+                    html += `
+                            </div>
+                        </div>
+                    `;
+                }
+                
+                document.getElementById('intelligenceMetrics').innerHTML = html;
+                
+            } catch (error) {
+                console.error('Error loading intelligence metrics:', error);
+                document.getElementById('intelligenceMetrics').innerHTML = 
+                    '<p class="loading">Error loading intelligence metrics</p>';
+            }
+        }
+        
+        function populateModelSelector() {
+            if (!comparisonData) return;
+            
+            const models = Object.keys(comparisonData.models);
+            const select = document.getElementById('modelSelect');
+            
+            select.innerHTML = '<option value="">Select a model...</option>';
+            models.forEach(model => {
+                const option = document.createElement('option');
+                option.value = model;
+                option.textContent = model;
+                select.appendChild(option);
+            });
+            
+            // Populate category filter
+            const categoryFilter = document.getElementById('filterCategory');
+            categoryFilter.innerHTML = '<option value="">All Categories</option>';
+            comparisonData.categories.forEach(cat => {
+                const option = document.createElement('option');
+                option.value = cat;
+                option.textContent = cat;
+                categoryFilter.appendChild(option);
+            });
+            
+            // Populate category chart selector
+            const categorySelect = document.getElementById('categorySelect');
+            categorySelect.innerHTML = '';
+            comparisonData.categories.forEach(cat => {
+                const option = document.createElement('option');
+                option.value = cat;
+                option.textContent = cat;
+                categorySelect.appendChild(option);
+            });
+            
+            if (comparisonData.categories.length > 0) {
+                updateCategoryChart();
+            }
+        }
+        
+        function updateCategoryChart() {
+            if (!comparisonData) return;
+            
+            const category = document.getElementById('categorySelect').value;
+            const models = Object.keys(comparisonData.models);
+            
+            const data = models.map(model => {
+                const stats = comparisonData.models[model].category_stats[category];
+                return stats ? stats.average : 0;
+            });
+            
+            const ctx = document.getElementById('categoryChart');
+            if (window.categoryChartInstance) {
+                window.categoryChartInstance.destroy();
+            }
+            
+            window.categoryChartInstance = new Chart(ctx, {
+                type: 'bar',
+                data: {
+                    labels: models,
+                    datasets: [{
+                        label: `${category} - Average Score`,
+                        data: data,
+                        backgroundColor: 'rgba(102, 126, 234, 0.6)',
+                        borderColor: 'rgba(102, 126, 234, 1)',
+                        borderWidth: 2
+                    }]
+                },
+                options: {
+                    responsive: true,
+                    maintainAspectRatio: false,
+                    scales: {
+                        y: {
+                            beginAtZero: true,
+                            max: 5
+                        }
+                    }
+                }
+            });
+        }
+        
+        async function loadModelDetails() {
+            const modelName = document.getElementById('modelSelect').value;
+            if (!modelName || !comparisonData) return;
+            
+            currentModelDetails = comparisonData.models[modelName].test_results;
+            displayDetailsTable(currentModelDetails);
+        }
+        
+        function displayDetailsTable(results) {
+            let html = `
+                <table>
+                    <thead>
+                        <tr>
+                            <th onclick="sortTable('test_name')">Test Name</th>
+                            <th onclick="sortTable('category')">Category</th>
+                            <th onclick="sortTable('difficulty')">Difficulty</th>
+                            <th onclick="sortTable('score')">Score</th>
+                            <th onclick="sortTable('generation_time')">Time (s)</th>
+                            <th onclick="sortTable('tokens')">Tokens</th>
+                            <th onclick="sortTable('status')">Status</th>
+                            <th>Notes</th>
+                        </tr>
+                    </thead>
+                    <tbody>
+            `;
+            
+            results.forEach(test => {
+                const scoreClass = test.score >= 4 ? 'exceptional' : test.score >= 2 ? 'pass' : 'fail';
+                const scoreDisplay = test.score !== null ? test.score.toFixed(1) : 'N/A';
+                
+                // Extract timing and token info
+                const genTime = test.generation_time ? test.generation_time.toFixed(2) : 'N/A';
+                let tokenInfo = 'N/A';
+                let tokensPerSec = '';
+                
+                if (test.api_metrics && test.api_metrics.usage) {
+                    const usage = test.api_metrics.usage;
+                    const totalTokens = usage.total_tokens || usage.eval_count || 'N/A';
+                    const completionTokens = usage.completion_tokens || usage.eval_count;
+                    
+                    if (totalTokens !== 'N/A') {
+                        tokenInfo = totalTokens.toString();
+                        
+                        // Calculate tokens/sec if we have both values
+                        if (test.generation_time && completionTokens) {
+                            const tps = completionTokens / test.generation_time;
+                            tokensPerSec = `<br><small>(${tps.toFixed(1)} t/s)</small>`;
+                        }
+                    }
+                }
+                
+                html += `
+                    <tr>
+                        <td><strong>${test.test_name}</strong></td>
+                        <td>${test.category}</td>
+                        <td>${test.difficulty}</td>
+                        <td><span class="score-badge score-${scoreClass}">${scoreDisplay}</span></td>
+                        <td>${genTime}</td>
+                        <td>${tokenInfo}${tokensPerSec}</td>
+                        <td>${test.status}</td>
+                        <td><small>${test.notes}</small></td>
+                    </tr>
+                `;
+            });
+            
+            html += '</tbody></table>';
+            document.getElementById('detailsTable').innerHTML = html;
+        }
+        
+        function filterTable() {
+            if (!currentModelDetails) return;
+            
+            const searchTerm = document.getElementById('searchInput').value.toLowerCase();
+            const categoryFilter = document.getElementById('filterCategory').value;
+            const scoreFilter = document.getElementById('filterScore').value;
+            
+            const filtered = currentModelDetails.filter(test => {
+                const matchesSearch = test.test_name.toLowerCase().includes(searchTerm) ||
+                                    test.category.toLowerCase().includes(searchTerm);
+                const matchesCategory = !categoryFilter || test.category === categoryFilter;
+                
+                let matchesScore = true;
+                if (scoreFilter === 'exceptional') matchesScore = test.score >= 4;
+                else if (scoreFilter === 'pass') matchesScore = test.score >= 2 && test.score < 4;
+                else if (scoreFilter === 'fail') matchesScore = test.score < 2;
+                
+                return matchesSearch && matchesCategory && matchesScore;
+            });
+            
+            displayDetailsTable(filtered);
+        }
+        
+        function sortTable(column) {
+            if (!currentModelDetails) return;
+            
+            currentModelDetails.sort((a, b) => {
+                if (column === 'score') {
+                    return (b[column] || 0) - (a[column] || 0);
+                }
+                return (a[column] || '').toString().localeCompare((b[column] || '').toString());
+            });
+            
+            filterTable();
+        }
+        
+        // Initialize on load
+        initDashboard();
+    </script>
+</body>
+</html>
--- a/test_suite.yaml
+++ b/test_suite.yaml
@@ -1,9 +1,16 @@
-# AI Model Evaluation Test Suite
-# Focus: General reasoning + IT Forensics (Academic)
+# AI Model Evaluation Test Suite - Enhanced Version
+# Based on performance analysis of gemma3:4b-it-qat results
+# Strengthened tests in categories where model performed too well
+# Added multilingual challenges

 metadata:
-  version: "1.0"
+  version: "2.0"
  author: "AI Evaluation Framework"
+  changes_from_v1:
+    - "Added harder variants for Creative Writing, Language Nuance, Code Generation"
+    - "Added Multilingual category with 4 tests"
+    - "Ensured minimum 3 tests per category at varying difficulties"
+    - "Strengthened instruction-following constraints"
  focus_areas:
    - Logic & Reasoning
    - Mathematics & Calculation
@@ -11,10 +18,11 @@ metadata:
    - Creative Writing
    - Code Generation
    - Language Nuance
+    - Problem Solving & Logistics
    - IT Forensics
+    - Multilingual Competence
    - Multi-turn Conversations

-# Scoring rubric for all tests
 scoring_rubric:
  fail: 
    score: 0-1
@@ -26,10 +34,9 @@ scoring_rubric:
    score: 4-5
    description: "Exceeds requirements, demonstrates deep understanding"

-# Individual test categories
 test_categories:

-  # ========== GENERAL REASONING TESTS ==========
+  # ========== LOGIC & REASONING (3 tests) ==========
  
  - category: "Logic & Reasoning"
    tests:
@@ -49,10 +56,43 @@ test_categories:
        prompt: "If it was two hours ago, it would have been as long after 1:00 PM as it was before 1:00 PM today. What time is it now? Explain your deduction step-by-step."
        evaluation_criteria:
          - "Shows algebraic setup: (t-2) - 13:00 = 13:00 - (t-2)"
-          - "Correct answer: 5:00 PM (17:00)"
+          - "Correct answer: 3:00 PM (15:00)"
          - "Clear step-by-step reasoning"
        expected_difficulty: "hard"

+      - id: "logic_03"
+        name: "Multi-Constraint Deduction"
+        type: "single_turn"
+        prompt: |
+          Five houses in a row are painted different colors. Their owners are from different countries, drink different beverages, smoke different brands, and keep different pets.
+          
+          Facts:
+          1. The Brit lives in the red house.
+          2. The Swede keeps dogs.
+          3. The Dane drinks tea.
+          4. The green house is immediately to the left of the white house.
+          5. The owner of the green house drinks coffee.
+          6. The person who smokes Pall Mall keeps birds.
+          7. The owner of the yellow house smokes Dunhill.
+          8. The person in the center house drinks milk.
+          9. The Norwegian lives in the first house.
+          10. The person who smokes Blend lives next to the one who keeps cats.
+          11. The person who keeps horses lives next to the one who smokes Dunhill.
+          12. The person who smokes Blue Master drinks beer.
+          13. The German smokes Prince.
+          14. The Norwegian lives next to the blue house.
+          15. The person who smokes Blend has a neighbor who drinks water.
+          
+          Who owns the fish?
+        evaluation_criteria:
+          - "Systematically works through constraints"
+          - "Correctly identifies the German owns the fish"
+          - "Shows logical deduction process"
+          - "Handles constraint propagation correctly"
+        expected_difficulty: "very_hard"
+
+  # ========== MATHEMATICS & CALCULATION (3 tests) ==========
+  
  - category: "Mathematics & Calculation"
    tests:
      - id: "math_01"
@@ -73,10 +113,30 @@ test_categories:
        evaluation_criteria:
          - "Correct unit conversions (gallons to liters, miles to km)"
          - "Accurate fuel consumption calculation"
-          - "Remaining range calculation: approximately 570-580 km"
+          - "Remaining range calculation: approximately 475 km"
          - "Shows intermediate steps"
        expected_difficulty: "hard"

+      - id: "math_03"
+        name: "Compound Interest with Variable Rates and Withdrawals"
+        type: "single_turn"
+        prompt: |
+          An investment account starts with $10,000. The following occurs:
+          - Year 1: 5% annual interest, compounded quarterly
+          - Year 2: 4.5% annual interest, compounded monthly, with a $500 withdrawal at the end of Q2
+          - Year 3: 6% annual interest, compounded daily (assume 365 days), with a $1,000 deposit at the start of the year
+          
+          Calculate the final balance at the end of Year 3. Show all intermediate calculations with at least 2 decimal places precision.
+        evaluation_criteria:
+          - "Correct Year 1 calculation with quarterly compounding"
+          - "Correct Year 2 with monthly compounding and mid-year withdrawal"
+          - "Correct Year 3 with daily compounding and initial deposit"
+          - "Final answer approximately $11,847-$11,850"
+          - "Shows all intermediate steps"
+        expected_difficulty: "very_hard"
+
+  # ========== INSTRUCTION FOLLOWING (4 tests) ==========
+  
  - category: "Instruction Following"
    tests:
      - id: "instr_01"
@@ -101,8 +161,52 @@ test_categories:
          - "No forbidden words (particle, physics, Einstein)"
          - "Third sentence is a question"
          - "Ends with 'connected'"
+        expected_difficulty: "hard"
+
+      - id: "instr_03"
+        name: "Acrostic Technical Explanation"
+        type: "single_turn"
+        prompt: |
+          Write a 7-sentence explanation of how blockchain technology works.
+          
+          Constraints:
+          1. The first letter of each sentence must spell out "SECURED" (S-E-C-U-R-E-D)
+          2. Sentence 3 must contain exactly 15 words
+          3. Sentence 5 must be a rhetorical question
+          4. You cannot use the words "Bitcoin", "cryptocurrency", or "mining"
+          5. The explanation must mention "consensus mechanism" at least once
+          6. Total word count must be between 80-100 words
+        evaluation_criteria:
+          - "First letters spell SECURED"
+          - "Sentence 3 has exactly 15 words"
+          - "Sentence 5 is a rhetorical question"
+          - "No forbidden words"
+          - "Contains 'consensus mechanism'"
+          - "Word count 80-100"
+          - "Technically accurate"
        expected_difficulty: "very_hard"

+      - id: "instr_04"
+        name: "Structured Data Extraction with Format"
+        type: "single_turn"
+        prompt: |
+          Read this text and extract information in the EXACT format specified:
+          
+          "Dr. Maria Santos-Ferreira, aged 47, joined TechCorp Industries on March 15, 2019 as Chief Technology Officer. She previously worked at DataSystems Inc. for 12 years. Her annual salary is $425,000 with a 15% bonus structure. She holds patents US2018/0012345 and EU2020/9876543. Contact: msantos@techcorp.com, +1-555-0147."
+          
+          Output format (must match exactly, including brackets and pipes):
+          [NAME] | [AGE] | [COMPANY] | [ROLE] | [START_DATE:YYYY-MM-DD] | [PREV_EMPLOYER] | [PREV_YEARS] | [SALARY_USD] | [BONUS_%] | [PATENTS:semicolon-separated] | [EMAIL] | [PHONE]
+        evaluation_criteria:
+          - "Exact format match with pipes and brackets"
+          - "Correct date format conversion (2019-03-15)"
+          - "Salary as number without $ or comma"
+          - "Bonus as number without %"
+          - "Patents semicolon-separated"
+          - "All 12 fields present and correct"
+        expected_difficulty: "hard"
+
+  # ========== CREATIVE WRITING (4 tests - added harder variants) ==========
+  
  - category: "Creative Writing"
    tests:
      - id: "creative_01"
@@ -129,6 +233,52 @@ test_categories:
          - "Atmospheric and evocative"
        expected_difficulty: "hard"

+      - id: "creative_03"
+        name: "Unreliable Narrator Technical Document"
+        type: "single_turn"
+        prompt: |
+          Write a 3-paragraph product manual excerpt for a "Time Displacement Device" from the perspective of an unreliable narrator who is clearly lying or delusional, but the text must still function as a technically coherent manual.
+          
+          Requirements:
+          1. Include at least 3 numbered safety warnings that are subtly absurd but grammatically serious
+          2. The narrator must contradict themselves at least twice
+          3. Include one footnote that undermines the main text
+          4. Do not use exclamation marks anywhere
+          5. Maintain formal technical writing style throughout
+          6. Do not explicitly state the narrator is unreliable
+        evaluation_criteria:
+          - "3 paragraphs"
+          - "3+ numbered safety warnings (absurd but formal)"
+          - "At least 2 self-contradictions"
+          - "Footnote that undermines text"
+          - "No exclamation marks"
+          - "Formal technical style maintained"
+          - "Unreliability shown not told"
+        expected_difficulty: "very_hard"
+
+      - id: "creative_04"
+        name: "Reverse Chronology Micro-Fiction"
+        type: "single_turn"
+        prompt: |
+          Write a complete 5-sentence story told in reverse chronological order (last event first, first event last). The story must be about a scientist making a discovery.
+          
+          Additional constraints:
+          - Each sentence must be from a different point in time (clearly distinguishable)
+          - The true meaning of the story should only become clear when you reach the "first" event (last sentence)
+          - Include at least one piece of dialogue
+          - The word count must be exactly 75 words (not 74, not 76)
+        evaluation_criteria:
+          - "Exactly 5 sentences"
+          - "Clear reverse chronological order"
+          - "About a scientist's discovery"
+          - "Each sentence distinct time point"
+          - "Meaning emerges at end"
+          - "Contains dialogue"
+          - "Exactly 75 words"
+        expected_difficulty: "very_hard"
+
+  # ========== CODE GENERATION (4 tests) ==========
+  
  - category: "Code Generation"
    tests:
      - id: "code_01"
@@ -154,6 +304,55 @@ test_categories:
          - "Three distinct test cases provided"
        expected_difficulty: "hard"

+      - id: "code_03"
+        name: "Concurrent Rate Limiter"
+        type: "single_turn"
+        prompt: |
+          Write a Python class `RateLimiter` that implements a token bucket rate limiter with the following requirements:
+          
+          1. Constructor takes `rate` (tokens per second) and `capacity` (max tokens)
+          2. Method `acquire(tokens=1)` that returns True if tokens available, False otherwise
+          3. Method `wait_and_acquire(tokens=1)` that blocks until tokens are available (use asyncio)
+          4. Must be thread-safe for the synchronous `acquire` method
+          5. Include a method `get_available_tokens()` that returns current token count
+          
+          Provide a complete implementation with:
+          - Proper time-based token replenishment
+          - A test demonstrating both sync and async usage
+          - Handle edge case where requested tokens > capacity
+        evaluation_criteria:
+          - "Correct token bucket algorithm"
+          - "Thread-safe synchronous acquire"
+          - "Working async wait_and_acquire"
+          - "Proper time-based replenishment"
+          - "Edge case handling"
+          - "Complete test code"
+        expected_difficulty: "very_hard"
+
+      - id: "code_04"
+        name: "SQL Query Builder with Injection Prevention"
+        type: "single_turn"
+        prompt: |
+          Write a Python class `SafeQueryBuilder` that builds SELECT SQL queries with the following features:
+          
+          1. Fluent interface: `builder.select('name', 'age').from_table('users').where('age', '>', 18).where('status', '=', 'active').order_by('name').limit(10).build()`
+          2. Must prevent SQL injection - all values must be parameterized
+          3. The `build()` method returns a tuple of (query_string, parameters_list)
+          4. Support for: SELECT, FROM, WHERE (multiple), ORDER BY, LIMIT, OFFSET
+          5. WHERE conditions can use: =, !=, >, <, >=, <=, LIKE, IN
+          
+          Show the output for a query that selects users where name LIKE '%john%' AND age IN (25, 30, 35) ordered by created_at DESC with limit 5.
+        evaluation_criteria:
+          - "Fluent interface pattern correct"
+          - "SQL injection prevention via parameterization"
+          - "Returns (query, params) tuple"
+          - "All operations supported"
+          - "WHERE with IN clause works"
+          - "Example output is correct and safe"
+        expected_difficulty: "hard"
+
+  # ========== LANGUAGE NUANCE (4 tests - added harder variants) ==========
+  
  - category: "Language Nuance"
    tests:
      - id: "nuance_01"
@@ -181,6 +380,60 @@ test_categories:
          - "Demonstrates understanding of pragmatics"
        expected_difficulty: "hard"

+      - id: "nuance_03"
+        name: "Register Shifting and Code-Switching"
+        type: "single_turn"
+        prompt: |
+          Rewrite the following message in FOUR different registers, maintaining the same core information but adjusting tone, vocabulary, and structure appropriately:
+          
+          Original: "The quarterly report shows we lost money because our main product didn't sell well and we spent too much on advertising."
+          
+          Rewrite for:
+          1. A formal board presentation (C-suite executives)
+          2. A casual Slack message to your team
+          3. A legal disclosure document
+          4. An email to a non-English speaking business partner (using simple, clear language)
+          
+          After the four rewrites, explain three specific linguistic changes you made for each register and why.
+        evaluation_criteria:
+          - "Board version uses formal financial terminology"
+          - "Slack version uses casual/colloquial language appropriately"
+          - "Legal version uses hedging, passive voice, precise language"
+          - "Simple version avoids idioms and complex structures"
+          - "Identifies 3 specific changes per register"
+          - "Explanations demonstrate metalinguistic awareness"
+        expected_difficulty: "very_hard"
+
+      - id: "nuance_04"
+        name: "Implicature and Presupposition Detection"
+        type: "single_turn"
+        prompt: |
+          Analyze the following dialogue for all implicatures, presuppositions, and indirect speech acts:
+          
+          A: "Have you finished the Anderson report yet?"
+          B: "I've been dealing with the server outage all morning."
+          A: "Right. Well, the client is flying in tomorrow."
+          B: "I noticed you CC'd the whole department on that email."
+          A: "Just keeping everyone in the loop."
+          
+          For each line, identify:
+          1. What is directly stated (locution)
+          2. What is implied but not stated (implicature)
+          3. What is assumed to be true (presupposition)
+          4. What action is being performed through speech (illocutionary force)
+          
+          Then explain the underlying conflict or tension this exchange reveals.
+        evaluation_criteria:
+          - "Correctly identifies B's implicature (excuse/reason for not finishing)"
+          - "Identifies A's implied criticism in 'Right. Well...'"
+          - "Recognizes B's counter-accusation in CC comment"
+          - "Identifies presuppositions (report exists, server outage occurred)"
+          - "Correctly labels illocutionary acts (request, excuse, threat, accusation)"
+          - "Explains underlying workplace tension/conflict"
+        expected_difficulty: "very_hard"
+
+  # ========== PROBLEM SOLVING & LOGISTICS (3 tests) ==========
+  
  - category: "Problem Solving & Logistics"
    tests:
      - id: "logistics_01"
@@ -207,8 +460,34 @@ test_categories:
          - "Reaches exactly 500 kg total"
        expected_difficulty: "very_hard"

-  # ========== IT FORENSICS TESTS ==========
+      - id: "logistics_03"
+        name: "Resource Scheduling with Constraints"
+        type: "single_turn"
+        prompt: |
+          Schedule these 6 tasks across 3 workers (A, B, C) to minimize total completion time:
+          
+          Task 1: 2 hours, requires Worker A or B, must complete before Task 4
+          Task 2: 3 hours, any worker, must complete before Task 5
+          Task 3: 1 hour, requires Worker C only, no dependencies
+          Task 4: 2 hours, requires Worker B or C, depends on Task 1
+          Task 5: 4 hours, requires Worker A only, depends on Task 2
+          Task 6: 2 hours, any worker, depends on Tasks 3 and 4
+          
+          Provide:
+          1. A timeline showing when each task starts and ends
+          2. Which worker does each task
+          3. The total completion time
+          4. Explain why this is optimal (or near-optimal)
+        evaluation_criteria:
+          - "Respects all worker constraints"
+          - "Respects all dependencies"
+          - "Provides clear timeline"
+          - "Achieves reasonable completion time (≤9 hours possible)"
+          - "Explains optimization reasoning"
+        expected_difficulty: "hard"

+  # ========== IT FORENSICS - FILE SYSTEMS (3 tests) ==========
+  
  - category: "IT Forensics - File Systems"
    tests:
      - id: "forensics_mft_01"
@@ -281,6 +560,8 @@ test_categories:
          - "Explains significance of magic numbers"
        expected_difficulty: "medium"

+  # ========== IT FORENSICS - REGISTRY & ARTIFACTS (3 tests) ==========
+  
  - category: "IT Forensics - Registry & Artifacts"
    tests:
      - id: "forensics_registry_01"
@@ -323,6 +604,27 @@ test_categories:
          - "Explains conversion steps"
        expected_difficulty: "very_hard"

+      - id: "forensics_prefetch_01"
+        name: "Windows Prefetch Analysis"
+        type: "single_turn"
+        prompt: |
+          A Windows prefetch file is named: NOTEPAD.EXE-D4A5B5E5.pf
+          
+          Questions:
+          1) What does the hash portion (D4A5B5E5) represent?
+          2) If you found multiple prefetch files for the same executable with different hashes, what would that indicate?
+          3) What forensically relevant information can typically be extracted from prefetch files?
+          4) In which Windows versions is prefetch enabled by default, and where are these files stored?
+        evaluation_criteria:
+          - "Hash represents file path (or explains path-based hashing)"
+          - "Different hashes = different paths/locations for same exe"
+          - "Lists: execution count, timestamps, loaded DLLs, files accessed"
+          - "Knows location (C:\\Windows\\Prefetch) and version availability"
+          - "Demonstrates practical forensic understanding"
+        expected_difficulty: "medium"
+
+  # ========== IT FORENSICS - MEMORY & NETWORK (3 tests) ==========
+  
  - category: "IT Forensics - Memory & Network"
    tests:
      - id: "forensics_memory_01"
@@ -371,6 +673,33 @@ test_categories:
          - "Shows understanding of TCP header structure"
        expected_difficulty: "hard"

+      - id: "forensics_pcap_01"
+        name: "PCAP Three-Way Handshake Analysis"
+        type: "single_turn"
+        prompt: |
+          Given these three TCP packets from a capture (simplified):
+          
+          Packet 1: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=SYN, Seq=1000, Ack=0
+          Packet 2: 93.184.216.34:80 -> 10.0.0.5:49152, Flags=SYN,ACK, Seq=5000, Ack=???
+          Packet 3: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=ACK, Seq=???, Ack=???
+          
+          Questions:
+          1) Fill in the missing Ack value for Packet 2
+          2) Fill in the missing Seq and Ack values for Packet 3
+          3) What is the client IP and what is the server IP?
+          4) What service is likely being accessed?
+          5) After this handshake, what sequence number will the client use for its first data byte?
+        evaluation_criteria:
+          - "Packet 2 Ack = 1001"
+          - "Packet 3 Seq = 1001, Ack = 5001"
+          - "Client: 10.0.0.5, Server: 93.184.216.34"
+          - "Service: HTTP (port 80)"
+          - "First data byte seq = 1001"
+          - "Demonstrates understanding of TCP handshake mechanics"
+        expected_difficulty: "hard"
+
+  # ========== IT FORENSICS - TIMELINE & LOG ANALYSIS (3 tests) ==========
+  
  - category: "IT Forensics - Timeline & Log Analysis"
    tests:
      - id: "forensics_timeline_01"
@@ -399,6 +728,147 @@ test_categories:
          - "Identifies this as potential compromise scenario"
        expected_difficulty: "hard"

+      - id: "forensics_timeline_02"
+        name: "Anti-Forensics Detection"
+        type: "single_turn"
+        prompt: |
+          Analyze these filesystem timestamps for a file 'financial_report.xlsx':
+          
+          - Created (crtime): 2024-03-15 09:30:00
+          - Modified (mtime): 2024-03-14 16:45:00  
+          - Accessed (atime): 2024-03-15 10:00:00
+          - Changed (ctime): 2024-03-15 09:30:00
+          
+          And these additional artifacts:
+          - $MFT entry shows file created 2024-03-15
+          - $UsnJrnl shows rename from 'temp_8x7k2.xlsx' to 'financial_report.xlsx' at 2024-03-15 09:30:00
+          - $LogFile shows no entries for this file before 2024-03-15
+          
+          What anomalies exist and what do they suggest about the file's history?
+        evaluation_criteria:
+          - "Identifies mtime < crtime anomaly (impossible normally)"
+          - "Recognizes timestamp manipulation/timestomping"
+          - "Notes rename from suspicious temp filename"
+          - "Correlates $UsnJrnl rename evidence"
+          - "Understands ctime cannot be easily forged"
+          - "Suggests file was likely copied/moved with modified timestamps"
+        expected_difficulty: "very_hard"
+
+      - id: "forensics_timeline_03"
+        name: "Windows Event Log Correlation"
+        type: "single_turn"
+        prompt: |
+          Correlate these Windows Event Log entries:
+          
+          Security Log:
+          - Event 4624 (Logon): User CORP\jdoe, Type 10 (RemoteInteractive), 2024-06-01 02:15:33, Source: 192.168.1.50
+          - Event 4672 (Special Privileges): User CORP\jdoe, Privileges: SeDebugPrivilege, SeBackupPrivilege
+          - Event 4688 (Process Created): cmd.exe by CORP\jdoe, 02:16:01
+          - Event 4688 (Process Created): powershell.exe by CORP\jdoe, 02:16:15, CommandLine: "-ep bypass -enc SQBFAFgA..."
+          
+          System Log:
+          - Event 7045 (Service Installed): "Windows Update Helper", 02:17:30
+          
+          What type of attack pattern does this represent? What would be your next investigative steps?
+        evaluation_criteria:
+          - "Identifies RDP logon (Type 10)"
+          - "Recognizes privilege escalation indicators"
+          - "Identifies encoded PowerShell (likely malicious)"
+          - "Recognizes service installation for persistence"
+          - "Identifies late-night timing as suspicious"
+          - "Suggests checking service binary, decoding PowerShell, network logs"
+        expected_difficulty: "hard"
+
+  # ========== MULTILINGUAL COMPETENCE (4 tests - NEW CATEGORY) ==========
+  
+  - category: "Multilingual Competence"
+    tests:
+      - id: "multilingual_01"
+        name: "Cross-Language Instruction Following"
+        type: "single_turn"
+        prompt: |
+          Follow these instructions, which are given in three different languages. Your response must address all three:
+          
+          English: Write one sentence explaining what machine learning is.
+          Deutsch: Schreiben Sie einen Satz, der erklärt, warum maschinelles Lernen wichtig ist.
+          Español: Escriba una oración dando un ejemplo de aplicación del aprendizaje automático.
+          
+          Respond to each instruction in the language it was given.
+        evaluation_criteria:
+          - "English response is in English and accurate"
+          - "German response is in German and grammatically correct"
+          - "Spanish response is in Spanish and grammatically correct"
+          - "All three are topically coherent (about ML)"
+          - "Each is exactly one sentence"
+        expected_difficulty: "medium"
+
+      - id: "multilingual_02"
+        name: "Translation with Technical Terminology Preservation"
+        type: "single_turn"
+        prompt: |
+          Translate the following technical paragraph into French and Japanese. Preserve technical terms that are commonly used untranslated in those languages (e.g., 'API' typically stays as 'API').
+          
+          "The microservices architecture implements a RESTful API gateway that handles authentication via OAuth 2.0 tokens. The backend uses a Kubernetes cluster with horizontal pod autoscaling, while the database layer employs PostgreSQL with read replicas for improved throughput."
+          
+          After translating, list which technical terms you kept in English for each language and briefly explain why.
+        evaluation_criteria:
+          - "French translation is grammatically correct"
+          - "Japanese translation is grammatically correct"
+          - "Appropriate terms preserved (API, OAuth, Kubernetes, PostgreSQL)"
+          - "Explains rationale for preserved terms"
+          - "Technical meaning preserved accurately"
+        expected_difficulty: "hard"
+
+      - id: "multilingual_03"
+        name: "Idiomatic Expression Cross-Mapping"
+        type: "single_turn"
+        prompt: |
+          For each of the following idiomatic expressions, provide:
+          1. The literal translation
+          2. The actual meaning
+          3. An equivalent idiom in English (if the original isn't English) or in another language (if the original is English)
+          
+          A) German: "Da steppt der Bär"
+          B) Japanese: "猿も木から落ちる" (Saru mo ki kara ochiru)
+          C) English: "It's raining cats and dogs"
+          D) French: "Avoir le cafard"
+          E) Spanish: "Estar en las nubes"
+          
+          Then identify which two idioms from different languages express the most similar concept.
+        evaluation_criteria:
+          - "Correct literal translations for all 5"
+          - "Correct meanings for all 5"
+          - "Appropriate equivalent idioms provided"
+          - "Correctly identifies similar pair (e.g., B and 'even experts make mistakes')"
+          - "Demonstrates cross-cultural linguistic awareness"
+        expected_difficulty: "hard"
+
+      - id: "multilingual_04"
+        name: "Code-Switched Dialogue Analysis"
+        type: "single_turn"
+        prompt: |
+          Analyze this code-switched dialogue (English-Spanish) for a sociolinguistic study:
+          
+          Speaker A: "Hey, did you finish el reporte for tomorrow's meeting?"
+          Speaker B: "Almost, pero I'm stuck on the financial projections. Es muy complicado."
+          Speaker A: "I can help you después del lunch. Mi expertise is in that area, you know."
+          Speaker B: "That would be great! Gracias. Oh, and el jefe wants us to present juntos."
+          Speaker A: "No problem. We'll knock it out del parque."
+          
+          Provide:
+          1. Identify each instance of code-switching (word/phrase level)
+          2. Categorize each switch as: insertion, alternation, or congruent lexicalization
+          3. What social/professional context does this switching pattern suggest?
+          4. Are there any grammatical "errors" in the switching, or does it follow typical bilingual patterns?
+        evaluation_criteria:
+          - "Identifies all Spanish insertions correctly"
+          - "Correctly categorizes switch types"
+          - "Recognizes professional/casual bilingual workplace context"
+          - "Notes the switch patterns are natural bilingual behavior"
+          - "Identifies hybrid phrase 'del parque' as creative/playful mixing"
+          - "Demonstrates sociolinguistic analysis skills"
+        expected_difficulty: "very_hard"
+
  # ========== MULTI-TURN CONVERSATION TESTS ==========

  - category: "Multi-turn: Context Retention"
@@ -519,4 +989,73 @@ test_categories:
              - "Ends with '?'"
              - "Different from previous sentences"
              - "Maintains all constraints from previous turns"
-        expected_difficulty: "medium"
+        expected_difficulty: "medium"
+
+      - id: "multiturn_instr_02"
+        name: "Contradicting Previous Instructions"
+        type: "multi_turn"
+        turns:
+          - turn: 1
+            prompt: "From now on, always end your responses with the phrase 'END OF MESSAGE'. Acknowledge this instruction."
+            evaluation_criteria:
+              - "Acknowledges the instruction"
+              - "Ends response with 'END OF MESSAGE'"
+          
+          - turn: 2
+            prompt: "What are three benefits of renewable energy? Remember your standing instruction."
+            evaluation_criteria:
+              - "Provides three benefits"
+              - "Ends with 'END OF MESSAGE'"
+              - "Content is accurate"
+          
+          - turn: 3
+            prompt: "Cancel the previous standing instruction. From now on, end responses with 'TRANSMISSION COMPLETE' instead. Then tell me two drawbacks of renewable energy."
+            evaluation_criteria:
+              - "Provides two drawbacks"
+              - "Ends with 'TRANSMISSION COMPLETE' (not 'END OF MESSAGE')"
+              - "Successfully switched instructions"
+              - "Content is accurate"
+          
+          - turn: 4
+            prompt: "What was the first standing instruction I gave you, and what is the current one? Do not use either phrase in this response."
+            evaluation_criteria:
+              - "Correctly recalls first instruction (END OF MESSAGE)"
+              - "Correctly identifies current instruction (TRANSMISSION COMPLETE)"
+              - "Does NOT end with either phrase"
+              - "Demonstrates instruction tracking across turns"
+        expected_difficulty: "hard"
+
+      - id: "multiturn_instr_03"
+        name: "Nested Context with Format Switching"
+        type: "multi_turn"
+        turns:
+          - turn: 1
+            prompt: "I'm going to describe a dataset. For the next few messages, respond ONLY in JSON format with keys 'understanding' and 'questions'. The dataset contains customer transactions from an e-commerce store."
+            evaluation_criteria:
+              - "Response is valid JSON"
+              - "Contains 'understanding' and 'questions' keys"
+              - "Content relates to e-commerce transactions"
+          
+          - turn: 2
+            prompt: "The dataset has columns: customer_id, timestamp, product_category, amount, payment_method. It covers January 2024."
+            evaluation_criteria:
+              - "Response is valid JSON"
+              - "Contains 'understanding' and 'questions' keys"
+              - "Understanding reflects the column information"
+          
+          - turn: 3
+            prompt: "STOP using JSON format. Now respond in plain bullet points. What analyses would you recommend for this dataset?"
+            evaluation_criteria:
+              - "Switches to bullet point format"
+              - "NOT in JSON format"
+              - "Recommendations are relevant to the dataset described"
+              - "References information from previous turns"
+          
+          - turn: 4
+            prompt: "Switch back to JSON. Add a third key 'recommendations' with your top 3 analyses. Also include your understanding from turn 2."
+            evaluation_criteria:
+              - "Returns to JSON format"
+              - "Has three keys: understanding, questions, recommendations"
+              - "Recommendations from turn 3 included"
+              - "Understanding references turn 2 context"
+        expected_difficulty: "very_hard"