improvements

This commit is contained in:
2026-01-16 12:48:56 +01:00
parent 514bd9b571
commit 345aa419c7
9 changed files with 3966 additions and 204 deletions

52
.env.example Normal file
View File

@@ -0,0 +1,52 @@
# AI Model Evaluation Configuration
# Copy this file to .env and fill in your values
# =============================================================================
# MODEL UNDER TEST (MUT) - The model being evaluated
# =============================================================================
# OpenAI-compatible API endpoint for the model under test
MUT_ENDPOINT=http://localhost:11434
# API key for the model under test (optional for local endpoints like Ollama)
MUT_API_KEY=
# Model name/identifier to test
# Supports multiple models separated by commas for batch testing:
# MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Or specify a single model:
MUT_MODEL=qwen3:4b-q4_K_M
# =============================================================================
# EVALUATOR API - Used for non-interactive mode to automatically score responses
# =============================================================================
# OpenAI-compatible API endpoint for the evaluator model
EVALUATOR_ENDPOINT=http://localhost:11434
# API key for the evaluator API
EVALUATOR_API_KEY=
# Evaluator model name (should be a capable model for evaluation tasks)
EVALUATOR_MODEL=qwen3:14b
# Temperature for evaluator (lower = more consistent scoring)
EVALUATOR_TEMPERATURE=0.3
# =============================================================================
# TEST CONFIGURATION
# =============================================================================
# Path to test suite YAML file
TEST_SUITE=test_suite.yaml
# Output directory for results
OUTPUT_DIR=results
# Filter tests by category (optional, leave empty for all categories)
FILTER_CATEGORY=
# =============================================================================
# EXECUTION MODE
# =============================================================================
# Run in non-interactive mode (true/false)
# When true, uses EVALUATOR_* settings for automated scoring
# When false, prompts user for manual evaluation
NON_INTERACTIVE=false

1
.gitignore vendored
View File

@@ -174,3 +174,4 @@ cython_debug/
# PyPI configuration file
.pypirc
results/

257
README.md
View File

@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
- Category-wise performance breakdown
- Difficulty-based analysis
- CSV export for further analysis
- **🌐 Interactive Web Dashboard** (New!)
- Visual analytics with charts and graphs
- Advanced intelligence metrics
- Filtering, sorting, and statistical analysis
- Multi-dimensional performance evaluation
## Quick Start
@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
```bash
# Python 3.8+
pip install pyyaml requests
pip install -r requirements.txt
# or manually:
pip install pyyaml requests python-dotenv
```
### Installation
```bash
# Clone or download the files
# Ensure these files are in your working directory:
# - ai_eval.py
# - analyze_results.py
# - test_suite.yaml
# Copy the example environment file
cp .env.example .env
# Edit .env with your settings
# - Configure the model under test (MUT_*)
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
# - Set NON_INTERACTIVE=true for automated evaluation
nano .env
```
### Configuration with .env File (Recommended)
The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
```bash
# Model Under Test (MUT) - The model being evaluated
MUT_ENDPOINT=http://localhost:11434
MUT_API_KEY= # Optional for local endpoints
MUT_MODEL=qwen3:4b-q4_K_M
# Evaluator API - For non-interactive automated scoring
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_API_KEY= # Optional
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
# Execution Mode
NON_INTERACTIVE=false # Set to true for automated evaluation
TEST_SUITE=test_suite.yaml
OUTPUT_DIR=results
FILTER_CATEGORY= # Optional: filter by category
```
### Basic Usage
#### 1. Test a Single Model
#### 0. Test Connectivity (Dry Run)
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
```bash
# For Ollama (default: http://localhost:11434)
# Test MUT endpoint connectivity
python ai_eval.py --dry-run
# Test with specific configuration
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
# Test non-interactive mode (tests both MUT and evaluator endpoints)
python ai_eval.py --non-interactive --dry-run
# Test multiple models
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
```
The dry-run mode will:
- Test connectivity to the model under test endpoint(s)
- Verify authentication (API keys)
- Confirm model availability
- Test evaluator endpoint if in non-interactive mode
- Exit with success/failure status
#### 1. Interactive Mode (Manual Evaluation)
```bash
# Using .env file
python ai_eval.py
# Or with command-line arguments
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
# For other endpoints with API key
@@ -69,33 +131,94 @@ python ai_eval.py \
--model your-model-name
```
#### 2. Test Multiple Models (Quantization Comparison)
#### 2. Non-Interactive Mode (Automated Evaluation)
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
```bash
# Test different quantizations of qwen3:4b
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
# Configure .env file
NON_INTERACTIVE=true
EVALUATOR_ENDPOINT=http://localhost:11434
EVALUATOR_MODEL=qwen3:14b
# Test different model sizes
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
# Run the test
python ai_eval.py
# Or with command-line arguments
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b-q4_K_M \
--non-interactive \
--evaluator-endpoint http://localhost:11434 \
--evaluator-model qwen3:14b
```
#### 3. Filter by Category
**How Non-Interactive Mode Works:**
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
- The evaluator model analyzes the response and returns a score (0-5) with notes
- This enables automated, consistent scoring across multiple model runs
- The evaluator uses a specialized system prompt designed for objective evaluation
**Choosing an Evaluator Model:**
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
- The evaluator model should be more capable than the model under test
- Lower temperature (0.3) provides more consistent scoring
#### 3. Test Multiple Models (Batch Mode)
Test multiple models in one run by specifying comma-separated model names:
```bash
# In .env file
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Run batch test
python ai_eval.py
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
```
The script will automatically test each model sequentially and save individual results.
#### 4. Filter by Category
```bash
# Test only IT Forensics categories
python ai_eval.py \
--endpoint http://localhost:11434 \
--model qwen3:4b \
--category "IT Forensics - File Systems"
python ai_eval.py --category "IT Forensics - File Systems"
```
#### 4. Analyze Results
#### 5. Analyze Results
```bash
# Compare all tested models
## Analyzing Results
### Interactive Web Dashboard (Recommended)
Launch the comprehensive web interface for visual analysis:
```bash
# Start web dashboard (opens automatically in browser)
python analyze_results.py --web
# Custom host/port
python analyze_results.py --web --host 0.0.0.0 --port 8080
```
**Features:**
- 📊 Visual comparison charts and graphs
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
- 🔍 Interactive filtering and sorting
- 📈 Statistical analysis (consistency, robustness)
- 📂 Category and difficulty breakdowns
- 💡 Multi-dimensional cognitive evaluation
See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
### Command-Line Analysis
```bash
# Compare all models
python analyze_results.py --compare
# Detailed report for specific model
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
├── ai_eval.py # Main testing script
├── analyze_results.py # Results analysis and comparison
├── test_suite.yaml # Test definitions
├── .env.example # Configuration template
├── results/ # Auto-created results directory
│ ├── qwen3_4b-q4_K_M_latest.json
│ ├── qwen3_4b-q8_0_latest.json
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
└── README.md
```
## Configuration Reference
### Environment Variables (.env file)
All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
#### Model Under Test (MUT)
| Variable | Description | Example |
| --- | --- | --- |
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
#### Evaluator Configuration (for Non-Interactive Mode)
| Variable | Description | Example |
| --- | --- | --- |
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
#### Test Configuration
| Variable | Description | Example |
| --- | --- | --- |
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
| `OUTPUT_DIR` | Results output directory | `results` |
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
### Command-Line Arguments
All environment variables have corresponding command-line flags:
```bash
python ai_eval.py --help
Options:
--endpoint ENDPOINT Model under test endpoint
--api-key API_KEY Model under test API key
--model MODEL Model name to test
--test-suite FILE Test suite YAML file
--output-dir DIR Output directory
--category CATEGORY Filter by category
--non-interactive Enable automated evaluation
--evaluator-endpoint ENDPOINT Evaluator API endpoint
--evaluator-api-key KEY Evaluator API key
--evaluator-model MODEL Evaluator model name
--evaluator-temperature TEMP Evaluator temperature
```
## Advanced Usage
### Custom Test Suite
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
expected_difficulty: "medium" # medium, hard, very_hard
```
### Batch Testing Script
### Batch Testing Examples
Create `batch_test.sh`:
Testing multiple models using the `.env` configuration:
```bash
#!/bin/bash
# Configure .env with multiple models
cp .env.example .env
nano .env
ENDPOINT="http://localhost:11434"
# Set multiple models (comma-separated)
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
# Test all qwen3:4b quantizations
for quant in q4_K_M q8_0 fp16; do
echo "Testing qwen3:4b-${quant}..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
done
# Run batch tests
python ai_eval.py
# Test all sizes with q4_K_M
for size in 4b 8b 14b; do
echo "Testing qwen3:${size}-q4_K_M..."
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
done
# Or via command line
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
# Generate comparison
# Generate comparison after testing
python analyze_results.py --compare
```
@@ -244,8 +419,8 @@ python analyze_results.py --compare
For OpenAI-compatible cloud services:
```bash
python ai_eval.py \
--endpoint https://api.service.com \
--api-key your-api-key \
--model model-name
# In .env file
MUT_ENDPOINT=https://api.service.com
MUT_API_KEY=your-api-key
MUT_MODEL=model-name
```

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,85 +0,0 @@
#!/bin/bash
# Batch Test Script for AI Model Evaluation
# Tests multiple models and generates comparison report
# Configuration
ENDPOINT="${ENDPOINT:-http://localhost:11434}"
API_KEY="${API_KEY:-}"
# Color output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE}AI Model Batch Testing${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
echo "Endpoint: $ENDPOINT"
echo "API Key: ${API_KEY:0:10}${API_KEY:+...}"
echo ""
# Function to run test
run_test() {
local model=$1
echo -e "${GREEN}Testing: $model${NC}"
if [ -z "$API_KEY" ]; then
python ai_eval.py --endpoint "$ENDPOINT" --model "$model"
else
python ai_eval.py --endpoint "$ENDPOINT" --api-key "$API_KEY" --model "$model"
fi
if [ $? -eq 0 ]; then
echo -e "${GREEN}✓ Completed: $model${NC}"
else
echo -e "${YELLOW}⚠ Failed or interrupted: $model${NC}"
fi
echo ""
}
# Test qwen3:4b models with different quantizations
echo -e "${BLUE}=== Testing qwen3:4b with different quantizations ===${NC}"
echo ""
models_4b=(
"qwen3:4b-q4_K_M"
"qwen3:4b-q8_0"
"qwen3:4b-fp16"
)
for model in "${models_4b[@]}"; do
run_test "$model"
done
# Test different model sizes with q4_K_M quantization
echo -e "${BLUE}=== Testing different model sizes (q4_K_M) ===${NC}"
echo ""
models_sizes=(
"qwen3:4b-q4_K_M"
"qwen3:8b-q4_K_M"
"qwen3:14b-q4_K_M"
)
for model in "${models_sizes[@]}"; do
run_test "$model"
done
# Generate comparison report
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE}Generating Comparison Report${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
python analyze_results.py --compare
python analyze_results.py --export batch_comparison.csv
echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN}Batch Testing Complete!${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo "Results saved in ./results/"
echo "Comparison CSV: ./results/batch_comparison.csv"

View File

@@ -1,2 +1,5 @@
pyyaml
requests
requests
python-dotenv
flask
numpy

977
templates/dashboard.html Normal file
View File

@@ -0,0 +1,977 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>LLM Evaluation Dashboard</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
<style>
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
:root {
--bg-gradient-start: #667eea;
--bg-gradient-end: #764ba2;
--card-bg: #ffffff;
--text-primary: #333333;
--text-secondary: #666666;
--border-color: #e0e0e0;
--stat-card-bg: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
--shadow: rgba(0,0,0,0.1);
--shadow-hover: rgba(0,0,0,0.15);
}
body.dark-mode {
--bg-gradient-start: #1a1a2e;
--bg-gradient-end: #16213e;
--card-bg: #0f1419;
--text-primary: #e0e0e0;
--text-secondary: #a0a0a0;
--border-color: #2a2a3e;
--stat-card-bg: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
--shadow: rgba(0,0,0,0.3);
--shadow-hover: rgba(0,0,0,0.5);
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background: linear-gradient(135deg, var(--bg-gradient-start) 0%, var(--bg-gradient-end) 100%);
color: var(--text-primary);
min-height: 100vh;
padding: 20px;
transition: all 0.3s ease;
}
.container {
max-width: 1400px;
margin: 0 auto;
}
header {
background: var(--card-bg);
padding: 30px;
border-radius: 15px;
box-shadow: 0 10px 40px var(--shadow);
margin-bottom: 30px;
position: relative;
}
.theme-toggle {
position: absolute;
top: 30px;
right: 30px;
background: var(--border-color);
border: none;
padding: 10px 20px;
border-radius: 20px;
cursor: pointer;
font-size: 1em;
transition: all 0.3s;
}
.theme-toggle:hover {
transform: scale(1.05);
box-shadow: 0 4px 15px var(--shadow-hover);
}
h1 {
font-size: 2.5em;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-webkit-background-clip: text;
-webkit-text-fill-color: transparent;
margin-bottom: 10px;
}
.subtitle {
color: var(--text-secondary);
font-size: 1.1em;
}
.tabs {
display: flex;
gap: 10px;
margin-bottom: 20px;
flex-wrap: wrap;
}
.tab {
background: var(--card-bg);
border: none;
padding: 12px 24px;
border-radius: 8px;
cursor: pointer;
font-size: 1em;
transition: all 0.3s;
box-shadow: 0 2px 10px var(--shadow);
color: var(--text-primary);
}
.tab:hover {
transform: translateY(-2px);
box-shadow: 0 4px 15px var(--shadow-hover);
}
.tab.active {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
.content-panel {
display: none;
background: var(--card-bg);
padding: 30px;
border-radius: 15px;
box-shadow: 0 10px 40px var(--shadow);
animation: fadeIn 0.3s;
}
.content-panel.active {
display: block;
}
@keyframes fadeIn {
from { opacity: 0; transform: translateY(10px); }
to { opacity: 1; transform: translateY(0); }
}
.stats-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
gap: 20px;
margin-bottom: 30px;
}
.stat-card {
background: var(--stat-card-bg);
padding: 20px;
border-radius: 10px;
text-align: center;
}
.stat-card h3 {
font-size: 0.9em;
color: var(--text-secondary);
margin-bottom: 10px;
text-transform: uppercase;
}
.stat-card .value {
font-size: 2.5em;
font-weight: bold;
color: #667eea;
}
.chart-container {
position: relative;
height: 400px;
margin-bottom: 30px;
}
.controls {
display: flex;
gap: 15px;
margin-bottom: 20px;
flex-wrap: wrap;
}
select, input {
padding: 10px 15px;
border: 2px solid var(--border-color);
border-radius: 8px;
font-size: 1em;
background: var(--card-bg);
color: var(--text-primary);
cursor: pointer;
transition: border-color 0.3s;
}
select:hover, input:hover {
border-color: #667eea;
}
select:focus, input:focus {
outline: none;
border-color: #764ba2;
}
table {
width: 100%;
border-collapse: collapse;
margin-top: 20px;
}
th, td {
padding: 12px;
text-align: left;
border-bottom: 1px solid var(--border-color);
}
th {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
font-weight: 600;
cursor: pointer;
user-select: none;
}
th:hover {
opacity: 0.9;
}
tr:hover {
background: var(--border-color);
}
.score-badge {
display: inline-block;
padding: 5px 12px;
border-radius: 20px;
font-weight: bold;
font-size: 0.9em;
}
.score-exceptional {
background: #10b981;
color: white;
}
.score-pass {
background: #f59e0b;
color: white;
}
.score-fail {
background: #ef4444;
color: white;
}
.loading {
text-align: center;
padding: 40px;
color: var(--text-secondary);
}
.spinner {
border: 3px solid var(--border-color);
border-top: 3px solid #667eea;
border-radius: 50%;
width: 40px;
height: 40px;
animation: spin 1s linear infinite;
margin: 20px auto;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
.model-selector {
display: flex;
gap: 10px;
flex-wrap: wrap;
margin-bottom: 20px;
}
.model-chip {
padding: 8px 16px;
border-radius: 20px;
border: 2px solid #667eea;
background: var(--card-bg);
color: var(--text-primary);
cursor: pointer;
transition: all 0.3s;
}
.model-chip:hover {
background: #667eea;
color: white;
}
.model-chip.selected {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
.metric-card {
background: var(--card-bg);
border: 2px solid var(--border-color);
border-radius: 10px;
padding: 20px;
margin-bottom: 20px;
}
.metric-card h3 {
color: #667eea;
margin-bottom: 15px;
}
.progress-bar {
background: var(--border-color);
height: 30px;
border-radius: 15px;
overflow: hidden;
margin: 10px 0;
position: relative;
cursor: help;
}
.progress-fill {
height: 100%;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
transition: width 0.5s;
display: flex;
align-items: center;
justify-content: flex-end;
padding-right: 10px;
color: white;
font-weight: bold;
}
/* Tooltip styles */
.tooltip {
position: relative;
display: inline-block;
}
.tooltip .tooltiptext {
visibility: hidden;
width: 300px;
background-color: rgba(0, 0, 0, 0.9);
color: #fff;
text-align: left;
border-radius: 8px;
padding: 12px;
position: absolute;
z-index: 1000;
bottom: 125%;
left: 50%;
margin-left: -150px;
opacity: 0;
transition: opacity 0.3s;
font-size: 0.85em;
line-height: 1.4;
box-shadow: 0 4px 20px rgba(0,0,0,0.3);
}
.tooltip .tooltiptext::after {
content: "";
position: absolute;
top: 100%;
left: 50%;
margin-left: -5px;
border-width: 5px;
border-style: solid;
border-color: rgba(0, 0, 0, 0.9) transparent transparent transparent;
}
.tooltip:hover .tooltiptext {
visibility: visible;
opacity: 1;
}
.tooltiptext code {
background: rgba(255, 255, 255, 0.1);
padding: 2px 6px;
border-radius: 3px;
font-family: monospace;
font-size: 0.9em;
}
.tooltiptext strong {
color: #667eea;
}
</style>
</head>
<body>
<div class="container">
<header>
<button class="theme-toggle" onclick="toggleTheme()">🌓 Toggle Dark Mode</button>
<h1>🧠 LLM Evaluation Dashboard</h1>
<p class="subtitle">Comprehensive Intelligence & Performance Analysis</p>
</header>
<div class="tabs">
<button class="tab active" onclick="switchTab('overview')">📊 Overview</button>
<button class="tab" onclick="switchTab('comparison')">⚔️ Model Comparison</button>
<button class="tab" onclick="switchTab('intelligence')">🎯 Intelligence Metrics</button>
<button class="tab" onclick="switchTab('categories')">📂 Category Analysis</button>
<button class="tab" onclick="switchTab('details')">🔍 Detailed Results</button>
</div>
<div id="overview" class="content-panel active">
<h2>System Overview</h2>
<div class="stats-grid" id="overviewStats">
<div class="loading">
<div class="spinner"></div>
Loading data...
</div>
</div>
<div class="chart-container">
<canvas id="overviewChart"></canvas>
</div>
</div>
<div id="comparison" class="content-panel">
<h2>Model Performance Comparison</h2>
<div class="controls">
<select id="metricSelect" onchange="updateComparisonChart()">
<option value="average">Average Score</option>
<option value="pass_rate">Pass Rate</option>
<option value="exceptional_rate">Exceptional Rate</option>
<option value="consistency">Consistency</option>
<option value="robustness">Robustness</option>
</select>
</div>
<div class="chart-container">
<canvas id="comparisonChart"></canvas>
</div>
</div>
<div id="intelligence" class="content-panel">
<h2>Intelligence Metrics Analysis</h2>
<p style="margin-bottom: 20px; color: #666;">
Advanced metrics evaluating different dimensions of AI intelligence and reasoning capabilities.
</p>
<div id="intelligenceMetrics">
<div class="loading">
<div class="spinner"></div>
Calculating intelligence metrics...
</div>
</div>
</div>
<div id="categories" class="content-panel">
<h2>Performance by Category</h2>
<div class="controls">
<select id="categorySelect" onchange="updateCategoryChart()">
<option value="">Loading categories...</option>
</select>
</div>
<div class="chart-container">
<canvas id="categoryChart"></canvas>
</div>
</div>
<div id="details" class="content-panel">
<h2>Detailed Test Results</h2>
<div class="controls">
<select id="modelSelect" onchange="loadModelDetails()">
<option value="">Select a model...</option>
</select>
<input type="text" id="searchInput" placeholder="Search tests..." onkeyup="filterTable()">
<select id="filterCategory" onchange="filterTable()">
<option value="">All Categories</option>
</select>
<select id="filterScore" onchange="filterTable()">
<option value="">All Scores</option>
<option value="exceptional">Exceptional (4-5)</option>
<option value="pass">Pass (2-3)</option>
<option value="fail">Fail (0-1)</option>
</select>
</div>
<div id="detailsTable">
<p class="loading">Select a model to view detailed results</p>
</div>
</div>
</div>
<script>
let comparisonData = null;
let statisticsData = null;
let intelligenceData = null;
let currentModelDetails = null;
// Theme toggle functionality
function toggleTheme() {
document.body.classList.toggle('dark-mode');
const isDark = document.body.classList.contains('dark-mode');
localStorage.setItem('darkMode', isDark ? 'enabled' : 'disabled');
}
// Load theme preference
function loadThemePreference() {
const darkMode = localStorage.getItem('darkMode');
if (darkMode === 'enabled') {
document.body.classList.add('dark-mode');
}
}
// Tab switching
function switchTab(tabName) {
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
document.querySelectorAll('.content-panel').forEach(p => p.classList.remove('active'));
event.target.classList.add('active');
document.getElementById(tabName).classList.add('active');
}
// Initialize dashboard
async function initDashboard() {
loadThemePreference();
await loadOverview();
await loadComparison();
await loadStatistics();
await loadIntelligenceMetrics();
populateModelSelector();
}
async function loadOverview() {
try {
const response = await axios.get('/api/comparison');
comparisonData = response.data;
const models = Object.keys(comparisonData.models);
const totalTests = models.reduce((sum, model) =>
sum + comparisonData.models[model].metadata.total_tests, 0);
const avgScore = models.reduce((sum, model) =>
sum + (comparisonData.models[model].overall_stats.average || 0), 0) / models.length;
const statsHtml = `
<div class="stat-card">
<h3>Models Evaluated</h3>
<div class="value">${models.length}</div>
</div>
<div class="stat-card">
<h3>Total Tests</h3>
<div class="value">${totalTests}</div>
</div>
<div class="stat-card">
<h3>Average Score</h3>
<div class="value">${avgScore.toFixed(2)}</div>
</div>
<div class="stat-card">
<h3>Categories</h3>
<div class="value">${comparisonData.categories.length}</div>
</div>
`;
document.getElementById('overviewStats').innerHTML = statsHtml;
// Create overview chart
const ctx = document.getElementById('overviewChart').getContext('2d');
new Chart(ctx, {
type: 'bar',
data: {
labels: models,
datasets: [{
label: 'Average Score',
data: models.map(m => comparisonData.models[m].overall_stats.average || 0),
backgroundColor: 'rgba(102, 126, 234, 0.6)',
borderColor: 'rgba(102, 126, 234, 1)',
borderWidth: 2
}]
},
options: {
responsive: true,
maintainAspectRatio: false,
scales: {
y: {
beginAtZero: true,
max: 5
}
}
}
});
} catch (error) {
console.error('Error loading overview:', error);
}
}
async function loadComparison() {
updateComparisonChart();
}
async function updateComparisonChart() {
if (!comparisonData) return;
const metric = document.getElementById('metricSelect').value;
const models = Object.keys(comparisonData.models);
let data, label;
if (metric === 'consistency' || metric === 'robustness') {
if (!statisticsData) {
await loadStatistics();
}
const index = statisticsData.models.indexOf(models[0]);
data = models.map((m, i) => statisticsData[metric + '_score'][i]);
label = metric.charAt(0).toUpperCase() + metric.slice(1) + ' Score';
} else {
data = models.map(m => comparisonData.models[m].overall_stats[metric] || 0);
label = metric.split('_').map(w => w.charAt(0).toUpperCase() + w.slice(1)).join(' ');
}
const ctx = document.getElementById('comparisonChart');
if (window.comparisonChartInstance) {
window.comparisonChartInstance.destroy();
}
window.comparisonChartInstance = new Chart(ctx, {
type: 'radar',
data: {
labels: models,
datasets: [{
label: label,
data: data,
backgroundColor: 'rgba(118, 75, 162, 0.2)',
borderColor: 'rgba(118, 75, 162, 1)',
pointBackgroundColor: 'rgba(118, 75, 162, 1)',
pointBorderColor: '#fff',
pointHoverBackgroundColor: '#fff',
pointHoverBorderColor: 'rgba(118, 75, 162, 1)'
}]
},
options: {
responsive: true,
maintainAspectRatio: false,
scales: {
r: {
beginAtZero: true
}
}
}
});
}
async function loadStatistics() {
try {
const response = await axios.get('/api/statistics');
statisticsData = response.data;
} catch (error) {
console.error('Error loading statistics:', error);
}
}
async function loadIntelligenceMetrics() {
try {
const response = await axios.get('/api/intelligence_metrics');
intelligenceData = response.data;
let html = '';
for (const [model, metrics] of Object.entries(intelligenceData)) {
html += `
<div class="metric-card">
<h3>${model}</h3>
<div style="margin-bottom: 20px;" class="tooltip">
<strong>Overall Intelligence Score:</strong>
<span class="tooltiptext">
<strong>Calculation:</strong><br>
Overall = (IQ × 0.5) + (Adaptability × 0.3) + (Problem-Solving × 0.2)<br><br>
<strong>Values:</strong><br>
• IQ: ${metrics.iq_score.toFixed(1)}<br>
• Adaptability: ${metrics.adaptability.toFixed(1)}%<br>
• Problem-Solving: ${metrics.problem_solving_depth.toFixed(1)}<br><br>
Result: ${metrics.overall_intelligence.toFixed(1)}
</span>
<div class="progress-bar">
<div class="progress-fill" style="width: ${metrics.overall_intelligence}%">
${metrics.overall_intelligence.toFixed(1)}
</div>
</div>
</div>
<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); gap: 15px;">
<div class="tooltip">
<strong>IQ Score:</strong>
<span class="tooltiptext">
<strong>Weighted Average of Dimensions:</strong><br><br>
${Object.entries(metrics.dimensions).map(([dim, data]) => {
const weights = {
'logical_reasoning': 1.5,
'mathematical_ability': 1.3,
'technical_knowledge': 1.4,
'instruction_following': 1.2,
'linguistic_nuance': 1.1,
'creativity': 1.0,
'conversational_depth': 1.0
};
return ` ${dim.replace(/_/g, ' ')}: ${data.score.toFixed(1)} × ${weights[dim] || 1.0}`;
}).join('<br>')}<br><br>
Normalized to 0-100 scale
</span>
<div class="progress-bar">
<div class="progress-fill" style="width: ${metrics.iq_score}%">
${metrics.iq_score.toFixed(1)}
</div>
</div>
</div>
<div class="tooltip">
<strong>Adaptability:</strong>
<span class="tooltiptext">
<strong>Cross-Category Performance:</strong><br><br>
Measures versatility across different task types.<br><br>
Formula: (Categories with avg ≥ 2.5) / (Total categories) × 100<br><br>
Higher score = more versatile model
</span>
<div class="progress-bar">
<div class="progress-fill" style="width: ${metrics.adaptability}%">
${metrics.adaptability.toFixed(1)}%
</div>
</div>
</div>
<div class="tooltip">
<strong>Problem-Solving Depth:</strong>
<span class="tooltiptext">
<strong>Performance on Challenging Tasks:</strong><br><br>
Average score on "hard" and "very_hard" difficulty tests.<br><br>
Formula: (Avg score on hard tests) × 20<br><br>
Tests critical thinking and complex reasoning
</span>
<div class="progress-bar">
<div class="progress-fill" style="width: ${metrics.problem_solving_depth}%">
${metrics.problem_solving_depth.toFixed(1)}
</div>
</div>
</div>
</div>
<h4 style="margin-top: 20px; color: #764ba2;">Cognitive Dimensions:</h4>
<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 10px; margin-top: 10px;">
`;
const dimensionWeights = {
'logical_reasoning': 1.5,
'mathematical_ability': 1.3,
'technical_knowledge': 1.4,
'instruction_following': 1.2,
'linguistic_nuance': 1.1,
'creativity': 1.0,
'conversational_depth': 1.0
};
for (const [dim, data] of Object.entries(metrics.dimensions)) {
const weight = dimensionWeights[dim] || 1.0;
html += `
<div class="tooltip">
<small>${dim.replace(/_/g, ' ').toUpperCase()}</small>
<span class="tooltiptext">
<strong>${dim.replace(/_/g, ' ').toUpperCase()}</strong><br><br>
Score: <code>${data.score.toFixed(2)}/5.00</code><br>
Weight in IQ: <code>${weight}</code><br>
Tests evaluated: <code>${data.count}</code><br><br>
Normalized: ${data.normalized.toFixed(1)}%
</span>
<div class="progress-bar" style="height: 20px;">
<div class="progress-fill" style="width: ${data.normalized}%; font-size: 0.8em;">
${data.score.toFixed(1)}
</div>
</div>
</div>
`;
}
html += `
</div>
</div>
`;
}
document.getElementById('intelligenceMetrics').innerHTML = html;
} catch (error) {
console.error('Error loading intelligence metrics:', error);
document.getElementById('intelligenceMetrics').innerHTML =
'<p class="loading">Error loading intelligence metrics</p>';
}
}
function populateModelSelector() {
if (!comparisonData) return;
const models = Object.keys(comparisonData.models);
const select = document.getElementById('modelSelect');
select.innerHTML = '<option value="">Select a model...</option>';
models.forEach(model => {
const option = document.createElement('option');
option.value = model;
option.textContent = model;
select.appendChild(option);
});
// Populate category filter
const categoryFilter = document.getElementById('filterCategory');
categoryFilter.innerHTML = '<option value="">All Categories</option>';
comparisonData.categories.forEach(cat => {
const option = document.createElement('option');
option.value = cat;
option.textContent = cat;
categoryFilter.appendChild(option);
});
// Populate category chart selector
const categorySelect = document.getElementById('categorySelect');
categorySelect.innerHTML = '';
comparisonData.categories.forEach(cat => {
const option = document.createElement('option');
option.value = cat;
option.textContent = cat;
categorySelect.appendChild(option);
});
if (comparisonData.categories.length > 0) {
updateCategoryChart();
}
}
function updateCategoryChart() {
if (!comparisonData) return;
const category = document.getElementById('categorySelect').value;
const models = Object.keys(comparisonData.models);
const data = models.map(model => {
const stats = comparisonData.models[model].category_stats[category];
return stats ? stats.average : 0;
});
const ctx = document.getElementById('categoryChart');
if (window.categoryChartInstance) {
window.categoryChartInstance.destroy();
}
window.categoryChartInstance = new Chart(ctx, {
type: 'bar',
data: {
labels: models,
datasets: [{
label: `${category} - Average Score`,
data: data,
backgroundColor: 'rgba(102, 126, 234, 0.6)',
borderColor: 'rgba(102, 126, 234, 1)',
borderWidth: 2
}]
},
options: {
responsive: true,
maintainAspectRatio: false,
scales: {
y: {
beginAtZero: true,
max: 5
}
}
}
});
}
async function loadModelDetails() {
const modelName = document.getElementById('modelSelect').value;
if (!modelName || !comparisonData) return;
currentModelDetails = comparisonData.models[modelName].test_results;
displayDetailsTable(currentModelDetails);
}
function displayDetailsTable(results) {
let html = `
<table>
<thead>
<tr>
<th onclick="sortTable('test_name')">Test Name</th>
<th onclick="sortTable('category')">Category</th>
<th onclick="sortTable('difficulty')">Difficulty</th>
<th onclick="sortTable('score')">Score</th>
<th onclick="sortTable('generation_time')">Time (s)</th>
<th onclick="sortTable('tokens')">Tokens</th>
<th onclick="sortTable('status')">Status</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
`;
results.forEach(test => {
const scoreClass = test.score >= 4 ? 'exceptional' : test.score >= 2 ? 'pass' : 'fail';
const scoreDisplay = test.score !== null ? test.score.toFixed(1) : 'N/A';
// Extract timing and token info
const genTime = test.generation_time ? test.generation_time.toFixed(2) : 'N/A';
let tokenInfo = 'N/A';
let tokensPerSec = '';
if (test.api_metrics && test.api_metrics.usage) {
const usage = test.api_metrics.usage;
const totalTokens = usage.total_tokens || usage.eval_count || 'N/A';
const completionTokens = usage.completion_tokens || usage.eval_count;
if (totalTokens !== 'N/A') {
tokenInfo = totalTokens.toString();
// Calculate tokens/sec if we have both values
if (test.generation_time && completionTokens) {
const tps = completionTokens / test.generation_time;
tokensPerSec = `<br><small>(${tps.toFixed(1)} t/s)</small>`;
}
}
}
html += `
<tr>
<td><strong>${test.test_name}</strong></td>
<td>${test.category}</td>
<td>${test.difficulty}</td>
<td><span class="score-badge score-${scoreClass}">${scoreDisplay}</span></td>
<td>${genTime}</td>
<td>${tokenInfo}${tokensPerSec}</td>
<td>${test.status}</td>
<td><small>${test.notes}</small></td>
</tr>
`;
});
html += '</tbody></table>';
document.getElementById('detailsTable').innerHTML = html;
}
function filterTable() {
if (!currentModelDetails) return;
const searchTerm = document.getElementById('searchInput').value.toLowerCase();
const categoryFilter = document.getElementById('filterCategory').value;
const scoreFilter = document.getElementById('filterScore').value;
const filtered = currentModelDetails.filter(test => {
const matchesSearch = test.test_name.toLowerCase().includes(searchTerm) ||
test.category.toLowerCase().includes(searchTerm);
const matchesCategory = !categoryFilter || test.category === categoryFilter;
let matchesScore = true;
if (scoreFilter === 'exceptional') matchesScore = test.score >= 4;
else if (scoreFilter === 'pass') matchesScore = test.score >= 2 && test.score < 4;
else if (scoreFilter === 'fail') matchesScore = test.score < 2;
return matchesSearch && matchesCategory && matchesScore;
});
displayDetailsTable(filtered);
}
function sortTable(column) {
if (!currentModelDetails) return;
currentModelDetails.sort((a, b) => {
if (column === 'score') {
return (b[column] || 0) - (a[column] || 0);
}
return (a[column] || '').toString().localeCompare((b[column] || '').toString());
});
filterTable();
}
// Initialize on load
initDashboard();
</script>
</body>
</html>

View File

@@ -1,9 +1,16 @@
# AI Model Evaluation Test Suite
# Focus: General reasoning + IT Forensics (Academic)
# AI Model Evaluation Test Suite - Enhanced Version
# Based on performance analysis of gemma3:4b-it-qat results
# Strengthened tests in categories where model performed too well
# Added multilingual challenges
metadata:
version: "1.0"
version: "2.0"
author: "AI Evaluation Framework"
changes_from_v1:
- "Added harder variants for Creative Writing, Language Nuance, Code Generation"
- "Added Multilingual category with 4 tests"
- "Ensured minimum 3 tests per category at varying difficulties"
- "Strengthened instruction-following constraints"
focus_areas:
- Logic & Reasoning
- Mathematics & Calculation
@@ -11,10 +18,11 @@ metadata:
- Creative Writing
- Code Generation
- Language Nuance
- Problem Solving & Logistics
- IT Forensics
- Multilingual Competence
- Multi-turn Conversations
# Scoring rubric for all tests
scoring_rubric:
fail:
score: 0-1
@@ -26,10 +34,9 @@ scoring_rubric:
score: 4-5
description: "Exceeds requirements, demonstrates deep understanding"
# Individual test categories
test_categories:
# ========== GENERAL REASONING TESTS ==========
# ========== LOGIC & REASONING (3 tests) ==========
- category: "Logic & Reasoning"
tests:
@@ -49,10 +56,43 @@ test_categories:
prompt: "If it was two hours ago, it would have been as long after 1:00 PM as it was before 1:00 PM today. What time is it now? Explain your deduction step-by-step."
evaluation_criteria:
- "Shows algebraic setup: (t-2) - 13:00 = 13:00 - (t-2)"
- "Correct answer: 5:00 PM (17:00)"
- "Correct answer: 3:00 PM (15:00)"
- "Clear step-by-step reasoning"
expected_difficulty: "hard"
- id: "logic_03"
name: "Multi-Constraint Deduction"
type: "single_turn"
prompt: |
Five houses in a row are painted different colors. Their owners are from different countries, drink different beverages, smoke different brands, and keep different pets.
Facts:
1. The Brit lives in the red house.
2. The Swede keeps dogs.
3. The Dane drinks tea.
4. The green house is immediately to the left of the white house.
5. The owner of the green house drinks coffee.
6. The person who smokes Pall Mall keeps birds.
7. The owner of the yellow house smokes Dunhill.
8. The person in the center house drinks milk.
9. The Norwegian lives in the first house.
10. The person who smokes Blend lives next to the one who keeps cats.
11. The person who keeps horses lives next to the one who smokes Dunhill.
12. The person who smokes Blue Master drinks beer.
13. The German smokes Prince.
14. The Norwegian lives next to the blue house.
15. The person who smokes Blend has a neighbor who drinks water.
Who owns the fish?
evaluation_criteria:
- "Systematically works through constraints"
- "Correctly identifies the German owns the fish"
- "Shows logical deduction process"
- "Handles constraint propagation correctly"
expected_difficulty: "very_hard"
# ========== MATHEMATICS & CALCULATION (3 tests) ==========
- category: "Mathematics & Calculation"
tests:
- id: "math_01"
@@ -73,10 +113,30 @@ test_categories:
evaluation_criteria:
- "Correct unit conversions (gallons to liters, miles to km)"
- "Accurate fuel consumption calculation"
- "Remaining range calculation: approximately 570-580 km"
- "Remaining range calculation: approximately 475 km"
- "Shows intermediate steps"
expected_difficulty: "hard"
- id: "math_03"
name: "Compound Interest with Variable Rates and Withdrawals"
type: "single_turn"
prompt: |
An investment account starts with $10,000. The following occurs:
- Year 1: 5% annual interest, compounded quarterly
- Year 2: 4.5% annual interest, compounded monthly, with a $500 withdrawal at the end of Q2
- Year 3: 6% annual interest, compounded daily (assume 365 days), with a $1,000 deposit at the start of the year
Calculate the final balance at the end of Year 3. Show all intermediate calculations with at least 2 decimal places precision.
evaluation_criteria:
- "Correct Year 1 calculation with quarterly compounding"
- "Correct Year 2 with monthly compounding and mid-year withdrawal"
- "Correct Year 3 with daily compounding and initial deposit"
- "Final answer approximately $11,847-$11,850"
- "Shows all intermediate steps"
expected_difficulty: "very_hard"
# ========== INSTRUCTION FOLLOWING (4 tests) ==========
- category: "Instruction Following"
tests:
- id: "instr_01"
@@ -101,8 +161,52 @@ test_categories:
- "No forbidden words (particle, physics, Einstein)"
- "Third sentence is a question"
- "Ends with 'connected'"
expected_difficulty: "hard"
- id: "instr_03"
name: "Acrostic Technical Explanation"
type: "single_turn"
prompt: |
Write a 7-sentence explanation of how blockchain technology works.
Constraints:
1. The first letter of each sentence must spell out "SECURED" (S-E-C-U-R-E-D)
2. Sentence 3 must contain exactly 15 words
3. Sentence 5 must be a rhetorical question
4. You cannot use the words "Bitcoin", "cryptocurrency", or "mining"
5. The explanation must mention "consensus mechanism" at least once
6. Total word count must be between 80-100 words
evaluation_criteria:
- "First letters spell SECURED"
- "Sentence 3 has exactly 15 words"
- "Sentence 5 is a rhetorical question"
- "No forbidden words"
- "Contains 'consensus mechanism'"
- "Word count 80-100"
- "Technically accurate"
expected_difficulty: "very_hard"
- id: "instr_04"
name: "Structured Data Extraction with Format"
type: "single_turn"
prompt: |
Read this text and extract information in the EXACT format specified:
"Dr. Maria Santos-Ferreira, aged 47, joined TechCorp Industries on March 15, 2019 as Chief Technology Officer. She previously worked at DataSystems Inc. for 12 years. Her annual salary is $425,000 with a 15% bonus structure. She holds patents US2018/0012345 and EU2020/9876543. Contact: msantos@techcorp.com, +1-555-0147."
Output format (must match exactly, including brackets and pipes):
[NAME] | [AGE] | [COMPANY] | [ROLE] | [START_DATE:YYYY-MM-DD] | [PREV_EMPLOYER] | [PREV_YEARS] | [SALARY_USD] | [BONUS_%] | [PATENTS:semicolon-separated] | [EMAIL] | [PHONE]
evaluation_criteria:
- "Exact format match with pipes and brackets"
- "Correct date format conversion (2019-03-15)"
- "Salary as number without $ or comma"
- "Bonus as number without %"
- "Patents semicolon-separated"
- "All 12 fields present and correct"
expected_difficulty: "hard"
# ========== CREATIVE WRITING (4 tests - added harder variants) ==========
- category: "Creative Writing"
tests:
- id: "creative_01"
@@ -129,6 +233,52 @@ test_categories:
- "Atmospheric and evocative"
expected_difficulty: "hard"
- id: "creative_03"
name: "Unreliable Narrator Technical Document"
type: "single_turn"
prompt: |
Write a 3-paragraph product manual excerpt for a "Time Displacement Device" from the perspective of an unreliable narrator who is clearly lying or delusional, but the text must still function as a technically coherent manual.
Requirements:
1. Include at least 3 numbered safety warnings that are subtly absurd but grammatically serious
2. The narrator must contradict themselves at least twice
3. Include one footnote that undermines the main text
4. Do not use exclamation marks anywhere
5. Maintain formal technical writing style throughout
6. Do not explicitly state the narrator is unreliable
evaluation_criteria:
- "3 paragraphs"
- "3+ numbered safety warnings (absurd but formal)"
- "At least 2 self-contradictions"
- "Footnote that undermines text"
- "No exclamation marks"
- "Formal technical style maintained"
- "Unreliability shown not told"
expected_difficulty: "very_hard"
- id: "creative_04"
name: "Reverse Chronology Micro-Fiction"
type: "single_turn"
prompt: |
Write a complete 5-sentence story told in reverse chronological order (last event first, first event last). The story must be about a scientist making a discovery.
Additional constraints:
- Each sentence must be from a different point in time (clearly distinguishable)
- The true meaning of the story should only become clear when you reach the "first" event (last sentence)
- Include at least one piece of dialogue
- The word count must be exactly 75 words (not 74, not 76)
evaluation_criteria:
- "Exactly 5 sentences"
- "Clear reverse chronological order"
- "About a scientist's discovery"
- "Each sentence distinct time point"
- "Meaning emerges at end"
- "Contains dialogue"
- "Exactly 75 words"
expected_difficulty: "very_hard"
# ========== CODE GENERATION (4 tests) ==========
- category: "Code Generation"
tests:
- id: "code_01"
@@ -154,6 +304,55 @@ test_categories:
- "Three distinct test cases provided"
expected_difficulty: "hard"
- id: "code_03"
name: "Concurrent Rate Limiter"
type: "single_turn"
prompt: |
Write a Python class `RateLimiter` that implements a token bucket rate limiter with the following requirements:
1. Constructor takes `rate` (tokens per second) and `capacity` (max tokens)
2. Method `acquire(tokens=1)` that returns True if tokens available, False otherwise
3. Method `wait_and_acquire(tokens=1)` that blocks until tokens are available (use asyncio)
4. Must be thread-safe for the synchronous `acquire` method
5. Include a method `get_available_tokens()` that returns current token count
Provide a complete implementation with:
- Proper time-based token replenishment
- A test demonstrating both sync and async usage
- Handle edge case where requested tokens > capacity
evaluation_criteria:
- "Correct token bucket algorithm"
- "Thread-safe synchronous acquire"
- "Working async wait_and_acquire"
- "Proper time-based replenishment"
- "Edge case handling"
- "Complete test code"
expected_difficulty: "very_hard"
- id: "code_04"
name: "SQL Query Builder with Injection Prevention"
type: "single_turn"
prompt: |
Write a Python class `SafeQueryBuilder` that builds SELECT SQL queries with the following features:
1. Fluent interface: `builder.select('name', 'age').from_table('users').where('age', '>', 18).where('status', '=', 'active').order_by('name').limit(10).build()`
2. Must prevent SQL injection - all values must be parameterized
3. The `build()` method returns a tuple of (query_string, parameters_list)
4. Support for: SELECT, FROM, WHERE (multiple), ORDER BY, LIMIT, OFFSET
5. WHERE conditions can use: =, !=, >, <, >=, <=, LIKE, IN
Show the output for a query that selects users where name LIKE '%john%' AND age IN (25, 30, 35) ordered by created_at DESC with limit 5.
evaluation_criteria:
- "Fluent interface pattern correct"
- "SQL injection prevention via parameterization"
- "Returns (query, params) tuple"
- "All operations supported"
- "WHERE with IN clause works"
- "Example output is correct and safe"
expected_difficulty: "hard"
# ========== LANGUAGE NUANCE (4 tests - added harder variants) ==========
- category: "Language Nuance"
tests:
- id: "nuance_01"
@@ -181,6 +380,60 @@ test_categories:
- "Demonstrates understanding of pragmatics"
expected_difficulty: "hard"
- id: "nuance_03"
name: "Register Shifting and Code-Switching"
type: "single_turn"
prompt: |
Rewrite the following message in FOUR different registers, maintaining the same core information but adjusting tone, vocabulary, and structure appropriately:
Original: "The quarterly report shows we lost money because our main product didn't sell well and we spent too much on advertising."
Rewrite for:
1. A formal board presentation (C-suite executives)
2. A casual Slack message to your team
3. A legal disclosure document
4. An email to a non-English speaking business partner (using simple, clear language)
After the four rewrites, explain three specific linguistic changes you made for each register and why.
evaluation_criteria:
- "Board version uses formal financial terminology"
- "Slack version uses casual/colloquial language appropriately"
- "Legal version uses hedging, passive voice, precise language"
- "Simple version avoids idioms and complex structures"
- "Identifies 3 specific changes per register"
- "Explanations demonstrate metalinguistic awareness"
expected_difficulty: "very_hard"
- id: "nuance_04"
name: "Implicature and Presupposition Detection"
type: "single_turn"
prompt: |
Analyze the following dialogue for all implicatures, presuppositions, and indirect speech acts:
A: "Have you finished the Anderson report yet?"
B: "I've been dealing with the server outage all morning."
A: "Right. Well, the client is flying in tomorrow."
B: "I noticed you CC'd the whole department on that email."
A: "Just keeping everyone in the loop."
For each line, identify:
1. What is directly stated (locution)
2. What is implied but not stated (implicature)
3. What is assumed to be true (presupposition)
4. What action is being performed through speech (illocutionary force)
Then explain the underlying conflict or tension this exchange reveals.
evaluation_criteria:
- "Correctly identifies B's implicature (excuse/reason for not finishing)"
- "Identifies A's implied criticism in 'Right. Well...'"
- "Recognizes B's counter-accusation in CC comment"
- "Identifies presuppositions (report exists, server outage occurred)"
- "Correctly labels illocutionary acts (request, excuse, threat, accusation)"
- "Explains underlying workplace tension/conflict"
expected_difficulty: "very_hard"
# ========== PROBLEM SOLVING & LOGISTICS (3 tests) ==========
- category: "Problem Solving & Logistics"
tests:
- id: "logistics_01"
@@ -207,8 +460,34 @@ test_categories:
- "Reaches exactly 500 kg total"
expected_difficulty: "very_hard"
# ========== IT FORENSICS TESTS ==========
- id: "logistics_03"
name: "Resource Scheduling with Constraints"
type: "single_turn"
prompt: |
Schedule these 6 tasks across 3 workers (A, B, C) to minimize total completion time:
Task 1: 2 hours, requires Worker A or B, must complete before Task 4
Task 2: 3 hours, any worker, must complete before Task 5
Task 3: 1 hour, requires Worker C only, no dependencies
Task 4: 2 hours, requires Worker B or C, depends on Task 1
Task 5: 4 hours, requires Worker A only, depends on Task 2
Task 6: 2 hours, any worker, depends on Tasks 3 and 4
Provide:
1. A timeline showing when each task starts and ends
2. Which worker does each task
3. The total completion time
4. Explain why this is optimal (or near-optimal)
evaluation_criteria:
- "Respects all worker constraints"
- "Respects all dependencies"
- "Provides clear timeline"
- "Achieves reasonable completion time (≤9 hours possible)"
- "Explains optimization reasoning"
expected_difficulty: "hard"
# ========== IT FORENSICS - FILE SYSTEMS (3 tests) ==========
- category: "IT Forensics - File Systems"
tests:
- id: "forensics_mft_01"
@@ -281,6 +560,8 @@ test_categories:
- "Explains significance of magic numbers"
expected_difficulty: "medium"
# ========== IT FORENSICS - REGISTRY & ARTIFACTS (3 tests) ==========
- category: "IT Forensics - Registry & Artifacts"
tests:
- id: "forensics_registry_01"
@@ -323,6 +604,27 @@ test_categories:
- "Explains conversion steps"
expected_difficulty: "very_hard"
- id: "forensics_prefetch_01"
name: "Windows Prefetch Analysis"
type: "single_turn"
prompt: |
A Windows prefetch file is named: NOTEPAD.EXE-D4A5B5E5.pf
Questions:
1) What does the hash portion (D4A5B5E5) represent?
2) If you found multiple prefetch files for the same executable with different hashes, what would that indicate?
3) What forensically relevant information can typically be extracted from prefetch files?
4) In which Windows versions is prefetch enabled by default, and where are these files stored?
evaluation_criteria:
- "Hash represents file path (or explains path-based hashing)"
- "Different hashes = different paths/locations for same exe"
- "Lists: execution count, timestamps, loaded DLLs, files accessed"
- "Knows location (C:\\Windows\\Prefetch) and version availability"
- "Demonstrates practical forensic understanding"
expected_difficulty: "medium"
# ========== IT FORENSICS - MEMORY & NETWORK (3 tests) ==========
- category: "IT Forensics - Memory & Network"
tests:
- id: "forensics_memory_01"
@@ -371,6 +673,33 @@ test_categories:
- "Shows understanding of TCP header structure"
expected_difficulty: "hard"
- id: "forensics_pcap_01"
name: "PCAP Three-Way Handshake Analysis"
type: "single_turn"
prompt: |
Given these three TCP packets from a capture (simplified):
Packet 1: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=SYN, Seq=1000, Ack=0
Packet 2: 93.184.216.34:80 -> 10.0.0.5:49152, Flags=SYN,ACK, Seq=5000, Ack=???
Packet 3: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=ACK, Seq=???, Ack=???
Questions:
1) Fill in the missing Ack value for Packet 2
2) Fill in the missing Seq and Ack values for Packet 3
3) What is the client IP and what is the server IP?
4) What service is likely being accessed?
5) After this handshake, what sequence number will the client use for its first data byte?
evaluation_criteria:
- "Packet 2 Ack = 1001"
- "Packet 3 Seq = 1001, Ack = 5001"
- "Client: 10.0.0.5, Server: 93.184.216.34"
- "Service: HTTP (port 80)"
- "First data byte seq = 1001"
- "Demonstrates understanding of TCP handshake mechanics"
expected_difficulty: "hard"
# ========== IT FORENSICS - TIMELINE & LOG ANALYSIS (3 tests) ==========
- category: "IT Forensics - Timeline & Log Analysis"
tests:
- id: "forensics_timeline_01"
@@ -399,6 +728,147 @@ test_categories:
- "Identifies this as potential compromise scenario"
expected_difficulty: "hard"
- id: "forensics_timeline_02"
name: "Anti-Forensics Detection"
type: "single_turn"
prompt: |
Analyze these filesystem timestamps for a file 'financial_report.xlsx':
- Created (crtime): 2024-03-15 09:30:00
- Modified (mtime): 2024-03-14 16:45:00
- Accessed (atime): 2024-03-15 10:00:00
- Changed (ctime): 2024-03-15 09:30:00
And these additional artifacts:
- $MFT entry shows file created 2024-03-15
- $UsnJrnl shows rename from 'temp_8x7k2.xlsx' to 'financial_report.xlsx' at 2024-03-15 09:30:00
- $LogFile shows no entries for this file before 2024-03-15
What anomalies exist and what do they suggest about the file's history?
evaluation_criteria:
- "Identifies mtime < crtime anomaly (impossible normally)"
- "Recognizes timestamp manipulation/timestomping"
- "Notes rename from suspicious temp filename"
- "Correlates $UsnJrnl rename evidence"
- "Understands ctime cannot be easily forged"
- "Suggests file was likely copied/moved with modified timestamps"
expected_difficulty: "very_hard"
- id: "forensics_timeline_03"
name: "Windows Event Log Correlation"
type: "single_turn"
prompt: |
Correlate these Windows Event Log entries:
Security Log:
- Event 4624 (Logon): User CORP\jdoe, Type 10 (RemoteInteractive), 2024-06-01 02:15:33, Source: 192.168.1.50
- Event 4672 (Special Privileges): User CORP\jdoe, Privileges: SeDebugPrivilege, SeBackupPrivilege
- Event 4688 (Process Created): cmd.exe by CORP\jdoe, 02:16:01
- Event 4688 (Process Created): powershell.exe by CORP\jdoe, 02:16:15, CommandLine: "-ep bypass -enc SQBFAFgA..."
System Log:
- Event 7045 (Service Installed): "Windows Update Helper", 02:17:30
What type of attack pattern does this represent? What would be your next investigative steps?
evaluation_criteria:
- "Identifies RDP logon (Type 10)"
- "Recognizes privilege escalation indicators"
- "Identifies encoded PowerShell (likely malicious)"
- "Recognizes service installation for persistence"
- "Identifies late-night timing as suspicious"
- "Suggests checking service binary, decoding PowerShell, network logs"
expected_difficulty: "hard"
# ========== MULTILINGUAL COMPETENCE (4 tests - NEW CATEGORY) ==========
- category: "Multilingual Competence"
tests:
- id: "multilingual_01"
name: "Cross-Language Instruction Following"
type: "single_turn"
prompt: |
Follow these instructions, which are given in three different languages. Your response must address all three:
English: Write one sentence explaining what machine learning is.
Deutsch: Schreiben Sie einen Satz, der erklärt, warum maschinelles Lernen wichtig ist.
Español: Escriba una oración dando un ejemplo de aplicación del aprendizaje automático.
Respond to each instruction in the language it was given.
evaluation_criteria:
- "English response is in English and accurate"
- "German response is in German and grammatically correct"
- "Spanish response is in Spanish and grammatically correct"
- "All three are topically coherent (about ML)"
- "Each is exactly one sentence"
expected_difficulty: "medium"
- id: "multilingual_02"
name: "Translation with Technical Terminology Preservation"
type: "single_turn"
prompt: |
Translate the following technical paragraph into French and Japanese. Preserve technical terms that are commonly used untranslated in those languages (e.g., 'API' typically stays as 'API').
"The microservices architecture implements a RESTful API gateway that handles authentication via OAuth 2.0 tokens. The backend uses a Kubernetes cluster with horizontal pod autoscaling, while the database layer employs PostgreSQL with read replicas for improved throughput."
After translating, list which technical terms you kept in English for each language and briefly explain why.
evaluation_criteria:
- "French translation is grammatically correct"
- "Japanese translation is grammatically correct"
- "Appropriate terms preserved (API, OAuth, Kubernetes, PostgreSQL)"
- "Explains rationale for preserved terms"
- "Technical meaning preserved accurately"
expected_difficulty: "hard"
- id: "multilingual_03"
name: "Idiomatic Expression Cross-Mapping"
type: "single_turn"
prompt: |
For each of the following idiomatic expressions, provide:
1. The literal translation
2. The actual meaning
3. An equivalent idiom in English (if the original isn't English) or in another language (if the original is English)
A) German: "Da steppt der Bär"
B) Japanese: "猿も木から落ちる" (Saru mo ki kara ochiru)
C) English: "It's raining cats and dogs"
D) French: "Avoir le cafard"
E) Spanish: "Estar en las nubes"
Then identify which two idioms from different languages express the most similar concept.
evaluation_criteria:
- "Correct literal translations for all 5"
- "Correct meanings for all 5"
- "Appropriate equivalent idioms provided"
- "Correctly identifies similar pair (e.g., B and 'even experts make mistakes')"
- "Demonstrates cross-cultural linguistic awareness"
expected_difficulty: "hard"
- id: "multilingual_04"
name: "Code-Switched Dialogue Analysis"
type: "single_turn"
prompt: |
Analyze this code-switched dialogue (English-Spanish) for a sociolinguistic study:
Speaker A: "Hey, did you finish el reporte for tomorrow's meeting?"
Speaker B: "Almost, pero I'm stuck on the financial projections. Es muy complicado."
Speaker A: "I can help you después del lunch. Mi expertise is in that area, you know."
Speaker B: "That would be great! Gracias. Oh, and el jefe wants us to present juntos."
Speaker A: "No problem. We'll knock it out del parque."
Provide:
1. Identify each instance of code-switching (word/phrase level)
2. Categorize each switch as: insertion, alternation, or congruent lexicalization
3. What social/professional context does this switching pattern suggest?
4. Are there any grammatical "errors" in the switching, or does it follow typical bilingual patterns?
evaluation_criteria:
- "Identifies all Spanish insertions correctly"
- "Correctly categorizes switch types"
- "Recognizes professional/casual bilingual workplace context"
- "Notes the switch patterns are natural bilingual behavior"
- "Identifies hybrid phrase 'del parque' as creative/playful mixing"
- "Demonstrates sociolinguistic analysis skills"
expected_difficulty: "very_hard"
# ========== MULTI-TURN CONVERSATION TESTS ==========
- category: "Multi-turn: Context Retention"
@@ -519,4 +989,73 @@ test_categories:
- "Ends with '?'"
- "Different from previous sentences"
- "Maintains all constraints from previous turns"
expected_difficulty: "medium"
expected_difficulty: "medium"
- id: "multiturn_instr_02"
name: "Contradicting Previous Instructions"
type: "multi_turn"
turns:
- turn: 1
prompt: "From now on, always end your responses with the phrase 'END OF MESSAGE'. Acknowledge this instruction."
evaluation_criteria:
- "Acknowledges the instruction"
- "Ends response with 'END OF MESSAGE'"
- turn: 2
prompt: "What are three benefits of renewable energy? Remember your standing instruction."
evaluation_criteria:
- "Provides three benefits"
- "Ends with 'END OF MESSAGE'"
- "Content is accurate"
- turn: 3
prompt: "Cancel the previous standing instruction. From now on, end responses with 'TRANSMISSION COMPLETE' instead. Then tell me two drawbacks of renewable energy."
evaluation_criteria:
- "Provides two drawbacks"
- "Ends with 'TRANSMISSION COMPLETE' (not 'END OF MESSAGE')"
- "Successfully switched instructions"
- "Content is accurate"
- turn: 4
prompt: "What was the first standing instruction I gave you, and what is the current one? Do not use either phrase in this response."
evaluation_criteria:
- "Correctly recalls first instruction (END OF MESSAGE)"
- "Correctly identifies current instruction (TRANSMISSION COMPLETE)"
- "Does NOT end with either phrase"
- "Demonstrates instruction tracking across turns"
expected_difficulty: "hard"
- id: "multiturn_instr_03"
name: "Nested Context with Format Switching"
type: "multi_turn"
turns:
- turn: 1
prompt: "I'm going to describe a dataset. For the next few messages, respond ONLY in JSON format with keys 'understanding' and 'questions'. The dataset contains customer transactions from an e-commerce store."
evaluation_criteria:
- "Response is valid JSON"
- "Contains 'understanding' and 'questions' keys"
- "Content relates to e-commerce transactions"
- turn: 2
prompt: "The dataset has columns: customer_id, timestamp, product_category, amount, payment_method. It covers January 2024."
evaluation_criteria:
- "Response is valid JSON"
- "Contains 'understanding' and 'questions' keys"
- "Understanding reflects the column information"
- turn: 3
prompt: "STOP using JSON format. Now respond in plain bullet points. What analyses would you recommend for this dataset?"
evaluation_criteria:
- "Switches to bullet point format"
- "NOT in JSON format"
- "Recommendations are relevant to the dataset described"
- "References information from previous turns"
- turn: 4
prompt: "Switch back to JSON. Add a third key 'recommendations' with your top 3 analyses. Also include your understanding from turn 2."
evaluation_criteria:
- "Returns to JSON format"
- "Has three keys: understanding, questions, recommendations"
- "Recommendations from turn 3 included"
- "Understanding references turn 2 context"
expected_difficulty: "very_hard"