improvements
This commit is contained in:
52
.env.example
Normal file
52
.env.example
Normal file
@@ -0,0 +1,52 @@
|
||||
# AI Model Evaluation Configuration
|
||||
# Copy this file to .env and fill in your values
|
||||
|
||||
# =============================================================================
|
||||
# MODEL UNDER TEST (MUT) - The model being evaluated
|
||||
# =============================================================================
|
||||
# OpenAI-compatible API endpoint for the model under test
|
||||
MUT_ENDPOINT=http://localhost:11434
|
||||
|
||||
# API key for the model under test (optional for local endpoints like Ollama)
|
||||
MUT_API_KEY=
|
||||
|
||||
# Model name/identifier to test
|
||||
# Supports multiple models separated by commas for batch testing:
|
||||
# MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
||||
# Or specify a single model:
|
||||
MUT_MODEL=qwen3:4b-q4_K_M
|
||||
|
||||
# =============================================================================
|
||||
# EVALUATOR API - Used for non-interactive mode to automatically score responses
|
||||
# =============================================================================
|
||||
# OpenAI-compatible API endpoint for the evaluator model
|
||||
EVALUATOR_ENDPOINT=http://localhost:11434
|
||||
|
||||
# API key for the evaluator API
|
||||
EVALUATOR_API_KEY=
|
||||
|
||||
# Evaluator model name (should be a capable model for evaluation tasks)
|
||||
EVALUATOR_MODEL=qwen3:14b
|
||||
|
||||
# Temperature for evaluator (lower = more consistent scoring)
|
||||
EVALUATOR_TEMPERATURE=0.3
|
||||
|
||||
# =============================================================================
|
||||
# TEST CONFIGURATION
|
||||
# =============================================================================
|
||||
# Path to test suite YAML file
|
||||
TEST_SUITE=test_suite.yaml
|
||||
|
||||
# Output directory for results
|
||||
OUTPUT_DIR=results
|
||||
|
||||
# Filter tests by category (optional, leave empty for all categories)
|
||||
FILTER_CATEGORY=
|
||||
|
||||
# =============================================================================
|
||||
# EXECUTION MODE
|
||||
# =============================================================================
|
||||
# Run in non-interactive mode (true/false)
|
||||
# When true, uses EVALUATOR_* settings for automated scoring
|
||||
# When false, prompts user for manual evaluation
|
||||
NON_INTERACTIVE=false
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -174,3 +174,4 @@ cython_debug/
|
||||
# PyPI configuration file
|
||||
.pypirc
|
||||
|
||||
results/
|
||||
257
README.md
257
README.md
@@ -34,6 +34,11 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
|
||||
- Category-wise performance breakdown
|
||||
- Difficulty-based analysis
|
||||
- CSV export for further analysis
|
||||
- **🌐 Interactive Web Dashboard** (New!)
|
||||
- Visual analytics with charts and graphs
|
||||
- Advanced intelligence metrics
|
||||
- Filtering, sorting, and statistical analysis
|
||||
- Multi-dimensional performance evaluation
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -41,25 +46,82 @@ Comprehensive testing suite for evaluating AI models on general reasoning tasks
|
||||
|
||||
```bash
|
||||
# Python 3.8+
|
||||
pip install pyyaml requests
|
||||
pip install -r requirements.txt
|
||||
# or manually:
|
||||
pip install pyyaml requests python-dotenv
|
||||
```
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone or download the files
|
||||
# Ensure these files are in your working directory:
|
||||
# - ai_eval.py
|
||||
# - analyze_results.py
|
||||
# - test_suite.yaml
|
||||
# Copy the example environment file
|
||||
cp .env.example .env
|
||||
|
||||
# Edit .env with your settings
|
||||
# - Configure the model under test (MUT_*)
|
||||
# - Configure the evaluator model for non-interactive mode (EVALUATOR_*)
|
||||
# - Set NON_INTERACTIVE=true for automated evaluation
|
||||
nano .env
|
||||
```
|
||||
|
||||
### Configuration with .env File (Recommended)
|
||||
|
||||
The test suite can be configured using a `.env` file for easier batch testing and non-interactive mode:
|
||||
|
||||
```bash
|
||||
# Model Under Test (MUT) - The model being evaluated
|
||||
MUT_ENDPOINT=http://localhost:11434
|
||||
MUT_API_KEY= # Optional for local endpoints
|
||||
MUT_MODEL=qwen3:4b-q4_K_M
|
||||
|
||||
# Evaluator API - For non-interactive automated scoring
|
||||
EVALUATOR_ENDPOINT=http://localhost:11434
|
||||
EVALUATOR_API_KEY= # Optional
|
||||
EVALUATOR_MODEL=qwen3:14b # Use a capable model for evaluation
|
||||
EVALUATOR_TEMPERATURE=0.3 # Lower = more consistent scoring
|
||||
|
||||
# Execution Mode
|
||||
NON_INTERACTIVE=false # Set to true for automated evaluation
|
||||
TEST_SUITE=test_suite.yaml
|
||||
OUTPUT_DIR=results
|
||||
FILTER_CATEGORY= # Optional: filter by category
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
#### 1. Test a Single Model
|
||||
#### 0. Test Connectivity (Dry Run)
|
||||
|
||||
Before running the full test suite, verify that your API endpoints are reachable and properly configured:
|
||||
|
||||
```bash
|
||||
# For Ollama (default: http://localhost:11434)
|
||||
# Test MUT endpoint connectivity
|
||||
python ai_eval.py --dry-run
|
||||
|
||||
# Test with specific configuration
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b --dry-run
|
||||
|
||||
# Test non-interactive mode (tests both MUT and evaluator endpoints)
|
||||
python ai_eval.py --non-interactive --dry-run
|
||||
|
||||
# Test multiple models
|
||||
python ai_eval.py --model qwen3:4b,qwen3:8b,qwen3:14b --dry-run
|
||||
```
|
||||
|
||||
The dry-run mode will:
|
||||
- Test connectivity to the model under test endpoint(s)
|
||||
- Verify authentication (API keys)
|
||||
- Confirm model availability
|
||||
- Test evaluator endpoint if in non-interactive mode
|
||||
- Exit with success/failure status
|
||||
|
||||
#### 1. Interactive Mode (Manual Evaluation)
|
||||
|
||||
```bash
|
||||
# Using .env file
|
||||
python ai_eval.py
|
||||
|
||||
# Or with command-line arguments
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
|
||||
# For other endpoints with API key
|
||||
@@ -69,33 +131,94 @@ python ai_eval.py \
|
||||
--model your-model-name
|
||||
```
|
||||
|
||||
#### 2. Test Multiple Models (Quantization Comparison)
|
||||
#### 2. Non-Interactive Mode (Automated Evaluation)
|
||||
|
||||
Non-interactive mode uses a separate evaluator model to automatically score responses. This is ideal for batch testing and comparing multiple models without manual intervention.
|
||||
|
||||
```bash
|
||||
# Test different quantizations of qwen3:4b
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
|
||||
# Configure .env file
|
||||
NON_INTERACTIVE=true
|
||||
EVALUATOR_ENDPOINT=http://localhost:11434
|
||||
EVALUATOR_MODEL=qwen3:14b
|
||||
|
||||
# Test different model sizes
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
|
||||
# Run the test
|
||||
python ai_eval.py
|
||||
|
||||
# Or with command-line arguments
|
||||
python ai_eval.py \
|
||||
--endpoint http://localhost:11434 \
|
||||
--model qwen3:4b-q4_K_M \
|
||||
--non-interactive \
|
||||
--evaluator-endpoint http://localhost:11434 \
|
||||
--evaluator-model qwen3:14b
|
||||
```
|
||||
|
||||
#### 3. Filter by Category
|
||||
**How Non-Interactive Mode Works:**
|
||||
- For each test, the script sends the original prompt, model response, and evaluation criteria to the evaluator API
|
||||
- The evaluator model analyzes the response and returns a score (0-5) with notes
|
||||
- This enables automated, consistent scoring across multiple model runs
|
||||
- The evaluator uses a specialized system prompt designed for objective evaluation
|
||||
|
||||
**Choosing an Evaluator Model:**
|
||||
- Use a capable model (e.g., qwen3:14b, gpt-4, claude-3) for reliable evaluation
|
||||
- The evaluator model should be more capable than the model under test
|
||||
- Lower temperature (0.3) provides more consistent scoring
|
||||
|
||||
#### 3. Test Multiple Models (Batch Mode)
|
||||
|
||||
Test multiple models in one run by specifying comma-separated model names:
|
||||
|
||||
```bash
|
||||
# In .env file
|
||||
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
||||
|
||||
# Run batch test
|
||||
python ai_eval.py
|
||||
|
||||
# Or via command line
|
||||
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16
|
||||
```
|
||||
|
||||
The script will automatically test each model sequentially and save individual results.
|
||||
|
||||
#### 4. Filter by Category
|
||||
|
||||
```bash
|
||||
# Test only IT Forensics categories
|
||||
python ai_eval.py \
|
||||
--endpoint http://localhost:11434 \
|
||||
--model qwen3:4b \
|
||||
--category "IT Forensics - File Systems"
|
||||
python ai_eval.py --category "IT Forensics - File Systems"
|
||||
```
|
||||
|
||||
#### 4. Analyze Results
|
||||
#### 5. Analyze Results
|
||||
|
||||
```bash
|
||||
# Compare all tested models
|
||||
## Analyzing Results
|
||||
|
||||
### Interactive Web Dashboard (Recommended)
|
||||
|
||||
Launch the comprehensive web interface for visual analysis:
|
||||
|
||||
```bash
|
||||
# Start web dashboard (opens automatically in browser)
|
||||
python analyze_results.py --web
|
||||
|
||||
# Custom host/port
|
||||
python analyze_results.py --web --host 0.0.0.0 --port 8080
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- 📊 Visual comparison charts and graphs
|
||||
- 🎯 Advanced intelligence metrics (IQ, Adaptability, Problem-Solving Depth)
|
||||
- 🔍 Interactive filtering and sorting
|
||||
- 📈 Statistical analysis (consistency, robustness)
|
||||
- 📂 Category and difficulty breakdowns
|
||||
- 💡 Multi-dimensional cognitive evaluation
|
||||
|
||||
See [WEB_INTERFACE.md](WEB_INTERFACE.md) for detailed documentation.
|
||||
|
||||
### Command-Line Analysis
|
||||
|
||||
```bash
|
||||
# Compare all models
|
||||
python analyze_results.py --compare
|
||||
|
||||
# Detailed report for specific model
|
||||
@@ -188,6 +311,7 @@ All tests are evaluated on a 0-5 scale:
|
||||
├── ai_eval.py # Main testing script
|
||||
├── analyze_results.py # Results analysis and comparison
|
||||
├── test_suite.yaml # Test definitions
|
||||
├── .env.example # Configuration template
|
||||
├── results/ # Auto-created results directory
|
||||
│ ├── qwen3_4b-q4_K_M_latest.json
|
||||
│ ├── qwen3_4b-q8_0_latest.json
|
||||
@@ -195,6 +319,60 @@ All tests are evaluated on a 0-5 scale:
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Environment Variables (.env file)
|
||||
|
||||
All configuration can be set via `.env` file or command-line arguments. Command-line arguments override `.env` values.
|
||||
|
||||
#### Model Under Test (MUT)
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `MUT_ENDPOINT` | API endpoint for model under test | `http://localhost:11434` |
|
||||
| `MUT_API_KEY` | API key (optional for local endpoints) | `sk-...` |
|
||||
| `MUT_MODEL` | Model name/identifier | `qwen3:4b-q4_K_M` |
|
||||
|
||||
#### Evaluator Configuration (for Non-Interactive Mode)
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `EVALUATOR_ENDPOINT` | API endpoint for evaluator model | `http://localhost:11434` |
|
||||
| `EVALUATOR_API_KEY` | API key for evaluator | `sk-...` |
|
||||
| `EVALUATOR_MODEL` | Evaluator model name | `qwen3:14b` |
|
||||
| `EVALUATOR_TEMPERATURE` | Temperature for evaluator (lower = more consistent) | `0.3` |
|
||||
|
||||
#### Test Configuration
|
||||
|
||||
| Variable | Description | Example |
|
||||
| --- | --- | --- |
|
||||
| `NON_INTERACTIVE` | Enable automated evaluation | `true` or `false` |
|
||||
| `TEST_SUITE` | Path to test suite YAML file | `test_suite.yaml` |
|
||||
| `OUTPUT_DIR` | Results output directory | `results` |
|
||||
| `FILTER_CATEGORY` | Filter tests by category (optional) | `IT Forensics - File Systems` |
|
||||
|
||||
### Command-Line Arguments
|
||||
|
||||
All environment variables have corresponding command-line flags:
|
||||
|
||||
```bash
|
||||
python ai_eval.py --help
|
||||
|
||||
Options:
|
||||
--endpoint ENDPOINT Model under test endpoint
|
||||
--api-key API_KEY Model under test API key
|
||||
--model MODEL Model name to test
|
||||
--test-suite FILE Test suite YAML file
|
||||
--output-dir DIR Output directory
|
||||
--category CATEGORY Filter by category
|
||||
--non-interactive Enable automated evaluation
|
||||
--evaluator-endpoint ENDPOINT Evaluator API endpoint
|
||||
--evaluator-api-key KEY Evaluator API key
|
||||
--evaluator-model MODEL Evaluator model name
|
||||
--evaluator-temperature TEMP Evaluator temperature
|
||||
```
|
||||
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Test Suite
|
||||
@@ -214,28 +392,25 @@ Edit `test_suite.yaml` to add your own tests:
|
||||
expected_difficulty: "medium" # medium, hard, very_hard
|
||||
```
|
||||
|
||||
### Batch Testing Script
|
||||
### Batch Testing Examples
|
||||
|
||||
Create `batch_test.sh`:
|
||||
Testing multiple models using the `.env` configuration:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Configure .env with multiple models
|
||||
cp .env.example .env
|
||||
nano .env
|
||||
|
||||
ENDPOINT="http://localhost:11434"
|
||||
# Set multiple models (comma-separated)
|
||||
MUT_MODEL=qwen3:4b-q4_K_M,qwen3:4b-q8_0,qwen3:4b-fp16,qwen3:8b-q4_K_M
|
||||
|
||||
# Test all qwen3:4b quantizations
|
||||
for quant in q4_K_M q8_0 fp16; do
|
||||
echo "Testing qwen3:4b-${quant}..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
|
||||
done
|
||||
# Run batch tests
|
||||
python ai_eval.py
|
||||
|
||||
# Test all sizes with q4_K_M
|
||||
for size in 4b 8b 14b; do
|
||||
echo "Testing qwen3:${size}-q4_K_M..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
|
||||
done
|
||||
# Or via command line
|
||||
python ai_eval.py --model qwen3:4b-q4_K_M,qwen3:8b-q4_K_M,qwen3:14b-q4_K_M
|
||||
|
||||
# Generate comparison
|
||||
# Generate comparison after testing
|
||||
python analyze_results.py --compare
|
||||
```
|
||||
|
||||
@@ -244,8 +419,8 @@ python analyze_results.py --compare
|
||||
For OpenAI-compatible cloud services:
|
||||
|
||||
```bash
|
||||
python ai_eval.py \
|
||||
--endpoint https://api.service.com \
|
||||
--api-key your-api-key \
|
||||
--model model-name
|
||||
# In .env file
|
||||
MUT_ENDPOINT=https://api.service.com
|
||||
MUT_API_KEY=your-api-key
|
||||
MUT_MODEL=model-name
|
||||
```
|
||||
|
||||
913
ai_eval.py
913
ai_eval.py
File diff suppressed because it is too large
Load Diff
1321
analyze_results.py
1321
analyze_results.py
File diff suppressed because it is too large
Load Diff
@@ -1,85 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Batch Test Script for AI Model Evaluation
|
||||
# Tests multiple models and generates comparison report
|
||||
|
||||
# Configuration
|
||||
ENDPOINT="${ENDPOINT:-http://localhost:11434}"
|
||||
API_KEY="${API_KEY:-}"
|
||||
|
||||
# Color output
|
||||
GREEN='\033[0;32m'
|
||||
BLUE='\033[0;34m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo -e "${BLUE}AI Model Batch Testing${NC}"
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo ""
|
||||
echo "Endpoint: $ENDPOINT"
|
||||
echo "API Key: ${API_KEY:0:10}${API_KEY:+...}"
|
||||
echo ""
|
||||
|
||||
# Function to run test
|
||||
run_test() {
|
||||
local model=$1
|
||||
echo -e "${GREEN}Testing: $model${NC}"
|
||||
|
||||
if [ -z "$API_KEY" ]; then
|
||||
python ai_eval.py --endpoint "$ENDPOINT" --model "$model"
|
||||
else
|
||||
python ai_eval.py --endpoint "$ENDPOINT" --api-key "$API_KEY" --model "$model"
|
||||
fi
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo -e "${GREEN}✓ Completed: $model${NC}"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ Failed or interrupted: $model${NC}"
|
||||
fi
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Test qwen3:4b models with different quantizations
|
||||
echo -e "${BLUE}=== Testing qwen3:4b with different quantizations ===${NC}"
|
||||
echo ""
|
||||
|
||||
models_4b=(
|
||||
"qwen3:4b-q4_K_M"
|
||||
"qwen3:4b-q8_0"
|
||||
"qwen3:4b-fp16"
|
||||
)
|
||||
|
||||
for model in "${models_4b[@]}"; do
|
||||
run_test "$model"
|
||||
done
|
||||
|
||||
# Test different model sizes with q4_K_M quantization
|
||||
echo -e "${BLUE}=== Testing different model sizes (q4_K_M) ===${NC}"
|
||||
echo ""
|
||||
|
||||
models_sizes=(
|
||||
"qwen3:4b-q4_K_M"
|
||||
"qwen3:8b-q4_K_M"
|
||||
"qwen3:14b-q4_K_M"
|
||||
)
|
||||
|
||||
for model in "${models_sizes[@]}"; do
|
||||
run_test "$model"
|
||||
done
|
||||
|
||||
# Generate comparison report
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo -e "${BLUE}Generating Comparison Report${NC}"
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo ""
|
||||
|
||||
python analyze_results.py --compare
|
||||
python analyze_results.py --export batch_comparison.csv
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}========================================${NC}"
|
||||
echo -e "${GREEN}Batch Testing Complete!${NC}"
|
||||
echo -e "${GREEN}========================================${NC}"
|
||||
echo ""
|
||||
echo "Results saved in ./results/"
|
||||
echo "Comparison CSV: ./results/batch_comparison.csv"
|
||||
@@ -1,2 +1,5 @@
|
||||
pyyaml
|
||||
requests
|
||||
requests
|
||||
python-dotenv
|
||||
flask
|
||||
numpy
|
||||
977
templates/dashboard.html
Normal file
977
templates/dashboard.html
Normal file
@@ -0,0 +1,977 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>LLM Evaluation Dashboard</title>
|
||||
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
||||
<script src="https://cdn.jsdelivr.net/npm/axios/dist/axios.min.js"></script>
|
||||
<style>
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
:root {
|
||||
--bg-gradient-start: #667eea;
|
||||
--bg-gradient-end: #764ba2;
|
||||
--card-bg: #ffffff;
|
||||
--text-primary: #333333;
|
||||
--text-secondary: #666666;
|
||||
--border-color: #e0e0e0;
|
||||
--stat-card-bg: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
|
||||
--shadow: rgba(0,0,0,0.1);
|
||||
--shadow-hover: rgba(0,0,0,0.15);
|
||||
}
|
||||
|
||||
body.dark-mode {
|
||||
--bg-gradient-start: #1a1a2e;
|
||||
--bg-gradient-end: #16213e;
|
||||
--card-bg: #0f1419;
|
||||
--text-primary: #e0e0e0;
|
||||
--text-secondary: #a0a0a0;
|
||||
--border-color: #2a2a3e;
|
||||
--stat-card-bg: linear-gradient(135deg, #1a1a2e 0%, #16213e 100%);
|
||||
--shadow: rgba(0,0,0,0.3);
|
||||
--shadow-hover: rgba(0,0,0,0.5);
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
|
||||
background: linear-gradient(135deg, var(--bg-gradient-start) 0%, var(--bg-gradient-end) 100%);
|
||||
color: var(--text-primary);
|
||||
min-height: 100vh;
|
||||
padding: 20px;
|
||||
transition: all 0.3s ease;
|
||||
}
|
||||
|
||||
.container {
|
||||
max-width: 1400px;
|
||||
margin: 0 auto;
|
||||
}
|
||||
|
||||
header {
|
||||
background: var(--card-bg);
|
||||
padding: 30px;
|
||||
border-radius: 15px;
|
||||
box-shadow: 0 10px 40px var(--shadow);
|
||||
margin-bottom: 30px;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.theme-toggle {
|
||||
position: absolute;
|
||||
top: 30px;
|
||||
right: 30px;
|
||||
background: var(--border-color);
|
||||
border: none;
|
||||
padding: 10px 20px;
|
||||
border-radius: 20px;
|
||||
cursor: pointer;
|
||||
font-size: 1em;
|
||||
transition: all 0.3s;
|
||||
}
|
||||
|
||||
.theme-toggle:hover {
|
||||
transform: scale(1.05);
|
||||
box-shadow: 0 4px 15px var(--shadow-hover);
|
||||
}
|
||||
|
||||
h1 {
|
||||
font-size: 2.5em;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
-webkit-background-clip: text;
|
||||
-webkit-text-fill-color: transparent;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
color: var(--text-secondary);
|
||||
font-size: 1.1em;
|
||||
}
|
||||
|
||||
.tabs {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
margin-bottom: 20px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.tab {
|
||||
background: var(--card-bg);
|
||||
border: none;
|
||||
padding: 12px 24px;
|
||||
border-radius: 8px;
|
||||
cursor: pointer;
|
||||
font-size: 1em;
|
||||
transition: all 0.3s;
|
||||
box-shadow: 0 2px 10px var(--shadow);
|
||||
color: var(--text-primary);
|
||||
}
|
||||
|
||||
.tab:hover {
|
||||
transform: translateY(-2px);
|
||||
box-shadow: 0 4px 15px var(--shadow-hover);
|
||||
}
|
||||
|
||||
.tab.active {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
}
|
||||
|
||||
.content-panel {
|
||||
display: none;
|
||||
background: var(--card-bg);
|
||||
padding: 30px;
|
||||
border-radius: 15px;
|
||||
box-shadow: 0 10px 40px var(--shadow);
|
||||
animation: fadeIn 0.3s;
|
||||
}
|
||||
|
||||
.content-panel.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
@keyframes fadeIn {
|
||||
from { opacity: 0; transform: translateY(10px); }
|
||||
to { opacity: 1; transform: translateY(0); }
|
||||
}
|
||||
|
||||
.stats-grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(250px, 1fr));
|
||||
gap: 20px;
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.stat-card {
|
||||
background: var(--stat-card-bg);
|
||||
padding: 20px;
|
||||
border-radius: 10px;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.stat-card h3 {
|
||||
font-size: 0.9em;
|
||||
color: var(--text-secondary);
|
||||
margin-bottom: 10px;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.stat-card .value {
|
||||
font-size: 2.5em;
|
||||
font-weight: bold;
|
||||
color: #667eea;
|
||||
}
|
||||
|
||||
.chart-container {
|
||||
position: relative;
|
||||
height: 400px;
|
||||
margin-bottom: 30px;
|
||||
}
|
||||
|
||||
.controls {
|
||||
display: flex;
|
||||
gap: 15px;
|
||||
margin-bottom: 20px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
select, input {
|
||||
padding: 10px 15px;
|
||||
border: 2px solid var(--border-color);
|
||||
border-radius: 8px;
|
||||
font-size: 1em;
|
||||
background: var(--card-bg);
|
||||
color: var(--text-primary);
|
||||
cursor: pointer;
|
||||
transition: border-color 0.3s;
|
||||
}
|
||||
|
||||
select:hover, input:hover {
|
||||
border-color: #667eea;
|
||||
}
|
||||
|
||||
select:focus, input:focus {
|
||||
outline: none;
|
||||
border-color: #764ba2;
|
||||
}
|
||||
|
||||
table {
|
||||
width: 100%;
|
||||
border-collapse: collapse;
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
th, td {
|
||||
padding: 12px;
|
||||
text-align: left;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
th {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
font-weight: 600;
|
||||
cursor: pointer;
|
||||
user-select: none;
|
||||
}
|
||||
|
||||
th:hover {
|
||||
opacity: 0.9;
|
||||
}
|
||||
|
||||
tr:hover {
|
||||
background: var(--border-color);
|
||||
}
|
||||
|
||||
.score-badge {
|
||||
display: inline-block;
|
||||
padding: 5px 12px;
|
||||
border-radius: 20px;
|
||||
font-weight: bold;
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
.score-exceptional {
|
||||
background: #10b981;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.score-pass {
|
||||
background: #f59e0b;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.score-fail {
|
||||
background: #ef4444;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.loading {
|
||||
text-align: center;
|
||||
padding: 40px;
|
||||
color: var(--text-secondary);
|
||||
}
|
||||
|
||||
.spinner {
|
||||
border: 3px solid var(--border-color);
|
||||
border-top: 3px solid #667eea;
|
||||
border-radius: 50%;
|
||||
width: 40px;
|
||||
height: 40px;
|
||||
animation: spin 1s linear infinite;
|
||||
margin: 20px auto;
|
||||
}
|
||||
|
||||
@keyframes spin {
|
||||
0% { transform: rotate(0deg); }
|
||||
100% { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
.model-selector {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
flex-wrap: wrap;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.model-chip {
|
||||
padding: 8px 16px;
|
||||
border-radius: 20px;
|
||||
border: 2px solid #667eea;
|
||||
background: var(--card-bg);
|
||||
color: var(--text-primary);
|
||||
cursor: pointer;
|
||||
transition: all 0.3s;
|
||||
}
|
||||
|
||||
.model-chip:hover {
|
||||
background: #667eea;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.model-chip.selected {
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
}
|
||||
|
||||
.metric-card {
|
||||
background: var(--card-bg);
|
||||
border: 2px solid var(--border-color);
|
||||
border-radius: 10px;
|
||||
padding: 20px;
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.metric-card h3 {
|
||||
color: #667eea;
|
||||
margin-bottom: 15px;
|
||||
}
|
||||
|
||||
.progress-bar {
|
||||
background: var(--border-color);
|
||||
height: 30px;
|
||||
border-radius: 15px;
|
||||
overflow: hidden;
|
||||
margin: 10px 0;
|
||||
position: relative;
|
||||
cursor: help;
|
||||
}
|
||||
|
||||
.progress-fill {
|
||||
height: 100%;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
transition: width 0.5s;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: flex-end;
|
||||
padding-right: 10px;
|
||||
color: white;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
/* Tooltip styles */
|
||||
.tooltip {
|
||||
position: relative;
|
||||
display: inline-block;
|
||||
}
|
||||
|
||||
.tooltip .tooltiptext {
|
||||
visibility: hidden;
|
||||
width: 300px;
|
||||
background-color: rgba(0, 0, 0, 0.9);
|
||||
color: #fff;
|
||||
text-align: left;
|
||||
border-radius: 8px;
|
||||
padding: 12px;
|
||||
position: absolute;
|
||||
z-index: 1000;
|
||||
bottom: 125%;
|
||||
left: 50%;
|
||||
margin-left: -150px;
|
||||
opacity: 0;
|
||||
transition: opacity 0.3s;
|
||||
font-size: 0.85em;
|
||||
line-height: 1.4;
|
||||
box-shadow: 0 4px 20px rgba(0,0,0,0.3);
|
||||
}
|
||||
|
||||
.tooltip .tooltiptext::after {
|
||||
content: "";
|
||||
position: absolute;
|
||||
top: 100%;
|
||||
left: 50%;
|
||||
margin-left: -5px;
|
||||
border-width: 5px;
|
||||
border-style: solid;
|
||||
border-color: rgba(0, 0, 0, 0.9) transparent transparent transparent;
|
||||
}
|
||||
|
||||
.tooltip:hover .tooltiptext {
|
||||
visibility: visible;
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
.tooltiptext code {
|
||||
background: rgba(255, 255, 255, 0.1);
|
||||
padding: 2px 6px;
|
||||
border-radius: 3px;
|
||||
font-family: monospace;
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
.tooltiptext strong {
|
||||
color: #667eea;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<header>
|
||||
<button class="theme-toggle" onclick="toggleTheme()">🌓 Toggle Dark Mode</button>
|
||||
<h1>🧠 LLM Evaluation Dashboard</h1>
|
||||
<p class="subtitle">Comprehensive Intelligence & Performance Analysis</p>
|
||||
</header>
|
||||
|
||||
<div class="tabs">
|
||||
<button class="tab active" onclick="switchTab('overview')">📊 Overview</button>
|
||||
<button class="tab" onclick="switchTab('comparison')">⚔️ Model Comparison</button>
|
||||
<button class="tab" onclick="switchTab('intelligence')">🎯 Intelligence Metrics</button>
|
||||
<button class="tab" onclick="switchTab('categories')">📂 Category Analysis</button>
|
||||
<button class="tab" onclick="switchTab('details')">🔍 Detailed Results</button>
|
||||
</div>
|
||||
|
||||
<div id="overview" class="content-panel active">
|
||||
<h2>System Overview</h2>
|
||||
<div class="stats-grid" id="overviewStats">
|
||||
<div class="loading">
|
||||
<div class="spinner"></div>
|
||||
Loading data...
|
||||
</div>
|
||||
</div>
|
||||
<div class="chart-container">
|
||||
<canvas id="overviewChart"></canvas>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="comparison" class="content-panel">
|
||||
<h2>Model Performance Comparison</h2>
|
||||
<div class="controls">
|
||||
<select id="metricSelect" onchange="updateComparisonChart()">
|
||||
<option value="average">Average Score</option>
|
||||
<option value="pass_rate">Pass Rate</option>
|
||||
<option value="exceptional_rate">Exceptional Rate</option>
|
||||
<option value="consistency">Consistency</option>
|
||||
<option value="robustness">Robustness</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="chart-container">
|
||||
<canvas id="comparisonChart"></canvas>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="intelligence" class="content-panel">
|
||||
<h2>Intelligence Metrics Analysis</h2>
|
||||
<p style="margin-bottom: 20px; color: #666;">
|
||||
Advanced metrics evaluating different dimensions of AI intelligence and reasoning capabilities.
|
||||
</p>
|
||||
<div id="intelligenceMetrics">
|
||||
<div class="loading">
|
||||
<div class="spinner"></div>
|
||||
Calculating intelligence metrics...
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="categories" class="content-panel">
|
||||
<h2>Performance by Category</h2>
|
||||
<div class="controls">
|
||||
<select id="categorySelect" onchange="updateCategoryChart()">
|
||||
<option value="">Loading categories...</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="chart-container">
|
||||
<canvas id="categoryChart"></canvas>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div id="details" class="content-panel">
|
||||
<h2>Detailed Test Results</h2>
|
||||
<div class="controls">
|
||||
<select id="modelSelect" onchange="loadModelDetails()">
|
||||
<option value="">Select a model...</option>
|
||||
</select>
|
||||
<input type="text" id="searchInput" placeholder="Search tests..." onkeyup="filterTable()">
|
||||
<select id="filterCategory" onchange="filterTable()">
|
||||
<option value="">All Categories</option>
|
||||
</select>
|
||||
<select id="filterScore" onchange="filterTable()">
|
||||
<option value="">All Scores</option>
|
||||
<option value="exceptional">Exceptional (4-5)</option>
|
||||
<option value="pass">Pass (2-3)</option>
|
||||
<option value="fail">Fail (0-1)</option>
|
||||
</select>
|
||||
</div>
|
||||
<div id="detailsTable">
|
||||
<p class="loading">Select a model to view detailed results</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
let comparisonData = null;
|
||||
let statisticsData = null;
|
||||
let intelligenceData = null;
|
||||
let currentModelDetails = null;
|
||||
|
||||
// Theme toggle functionality
|
||||
function toggleTheme() {
|
||||
document.body.classList.toggle('dark-mode');
|
||||
const isDark = document.body.classList.contains('dark-mode');
|
||||
localStorage.setItem('darkMode', isDark ? 'enabled' : 'disabled');
|
||||
}
|
||||
|
||||
// Load theme preference
|
||||
function loadThemePreference() {
|
||||
const darkMode = localStorage.getItem('darkMode');
|
||||
if (darkMode === 'enabled') {
|
||||
document.body.classList.add('dark-mode');
|
||||
}
|
||||
}
|
||||
|
||||
// Tab switching
|
||||
function switchTab(tabName) {
|
||||
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
|
||||
document.querySelectorAll('.content-panel').forEach(p => p.classList.remove('active'));
|
||||
|
||||
event.target.classList.add('active');
|
||||
document.getElementById(tabName).classList.add('active');
|
||||
}
|
||||
|
||||
// Initialize dashboard
|
||||
async function initDashboard() {
|
||||
loadThemePreference();
|
||||
await loadOverview();
|
||||
await loadComparison();
|
||||
await loadStatistics();
|
||||
await loadIntelligenceMetrics();
|
||||
populateModelSelector();
|
||||
}
|
||||
|
||||
async function loadOverview() {
|
||||
try {
|
||||
const response = await axios.get('/api/comparison');
|
||||
comparisonData = response.data;
|
||||
|
||||
const models = Object.keys(comparisonData.models);
|
||||
const totalTests = models.reduce((sum, model) =>
|
||||
sum + comparisonData.models[model].metadata.total_tests, 0);
|
||||
const avgScore = models.reduce((sum, model) =>
|
||||
sum + (comparisonData.models[model].overall_stats.average || 0), 0) / models.length;
|
||||
|
||||
const statsHtml = `
|
||||
<div class="stat-card">
|
||||
<h3>Models Evaluated</h3>
|
||||
<div class="value">${models.length}</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<h3>Total Tests</h3>
|
||||
<div class="value">${totalTests}</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<h3>Average Score</h3>
|
||||
<div class="value">${avgScore.toFixed(2)}</div>
|
||||
</div>
|
||||
<div class="stat-card">
|
||||
<h3>Categories</h3>
|
||||
<div class="value">${comparisonData.categories.length}</div>
|
||||
</div>
|
||||
`;
|
||||
|
||||
document.getElementById('overviewStats').innerHTML = statsHtml;
|
||||
|
||||
// Create overview chart
|
||||
const ctx = document.getElementById('overviewChart').getContext('2d');
|
||||
new Chart(ctx, {
|
||||
type: 'bar',
|
||||
data: {
|
||||
labels: models,
|
||||
datasets: [{
|
||||
label: 'Average Score',
|
||||
data: models.map(m => comparisonData.models[m].overall_stats.average || 0),
|
||||
backgroundColor: 'rgba(102, 126, 234, 0.6)',
|
||||
borderColor: 'rgba(102, 126, 234, 1)',
|
||||
borderWidth: 2
|
||||
}]
|
||||
},
|
||||
options: {
|
||||
responsive: true,
|
||||
maintainAspectRatio: false,
|
||||
scales: {
|
||||
y: {
|
||||
beginAtZero: true,
|
||||
max: 5
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error loading overview:', error);
|
||||
}
|
||||
}
|
||||
|
||||
async function loadComparison() {
|
||||
updateComparisonChart();
|
||||
}
|
||||
|
||||
async function updateComparisonChart() {
|
||||
if (!comparisonData) return;
|
||||
|
||||
const metric = document.getElementById('metricSelect').value;
|
||||
const models = Object.keys(comparisonData.models);
|
||||
|
||||
let data, label;
|
||||
|
||||
if (metric === 'consistency' || metric === 'robustness') {
|
||||
if (!statisticsData) {
|
||||
await loadStatistics();
|
||||
}
|
||||
const index = statisticsData.models.indexOf(models[0]);
|
||||
data = models.map((m, i) => statisticsData[metric + '_score'][i]);
|
||||
label = metric.charAt(0).toUpperCase() + metric.slice(1) + ' Score';
|
||||
} else {
|
||||
data = models.map(m => comparisonData.models[m].overall_stats[metric] || 0);
|
||||
label = metric.split('_').map(w => w.charAt(0).toUpperCase() + w.slice(1)).join(' ');
|
||||
}
|
||||
|
||||
const ctx = document.getElementById('comparisonChart');
|
||||
if (window.comparisonChartInstance) {
|
||||
window.comparisonChartInstance.destroy();
|
||||
}
|
||||
|
||||
window.comparisonChartInstance = new Chart(ctx, {
|
||||
type: 'radar',
|
||||
data: {
|
||||
labels: models,
|
||||
datasets: [{
|
||||
label: label,
|
||||
data: data,
|
||||
backgroundColor: 'rgba(118, 75, 162, 0.2)',
|
||||
borderColor: 'rgba(118, 75, 162, 1)',
|
||||
pointBackgroundColor: 'rgba(118, 75, 162, 1)',
|
||||
pointBorderColor: '#fff',
|
||||
pointHoverBackgroundColor: '#fff',
|
||||
pointHoverBorderColor: 'rgba(118, 75, 162, 1)'
|
||||
}]
|
||||
},
|
||||
options: {
|
||||
responsive: true,
|
||||
maintainAspectRatio: false,
|
||||
scales: {
|
||||
r: {
|
||||
beginAtZero: true
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
async function loadStatistics() {
|
||||
try {
|
||||
const response = await axios.get('/api/statistics');
|
||||
statisticsData = response.data;
|
||||
} catch (error) {
|
||||
console.error('Error loading statistics:', error);
|
||||
}
|
||||
}
|
||||
|
||||
async function loadIntelligenceMetrics() {
|
||||
try {
|
||||
const response = await axios.get('/api/intelligence_metrics');
|
||||
intelligenceData = response.data;
|
||||
|
||||
let html = '';
|
||||
|
||||
for (const [model, metrics] of Object.entries(intelligenceData)) {
|
||||
html += `
|
||||
<div class="metric-card">
|
||||
<h3>${model}</h3>
|
||||
|
||||
<div style="margin-bottom: 20px;" class="tooltip">
|
||||
<strong>Overall Intelligence Score:</strong>
|
||||
<span class="tooltiptext">
|
||||
<strong>Calculation:</strong><br>
|
||||
Overall = (IQ × 0.5) + (Adaptability × 0.3) + (Problem-Solving × 0.2)<br><br>
|
||||
<strong>Values:</strong><br>
|
||||
• IQ: ${metrics.iq_score.toFixed(1)}<br>
|
||||
• Adaptability: ${metrics.adaptability.toFixed(1)}%<br>
|
||||
• Problem-Solving: ${metrics.problem_solving_depth.toFixed(1)}<br><br>
|
||||
Result: ${metrics.overall_intelligence.toFixed(1)}
|
||||
</span>
|
||||
<div class="progress-bar">
|
||||
<div class="progress-fill" style="width: ${metrics.overall_intelligence}%">
|
||||
${metrics.overall_intelligence.toFixed(1)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); gap: 15px;">
|
||||
<div class="tooltip">
|
||||
<strong>IQ Score:</strong>
|
||||
<span class="tooltiptext">
|
||||
<strong>Weighted Average of Dimensions:</strong><br><br>
|
||||
${Object.entries(metrics.dimensions).map(([dim, data]) => {
|
||||
const weights = {
|
||||
'logical_reasoning': 1.5,
|
||||
'mathematical_ability': 1.3,
|
||||
'technical_knowledge': 1.4,
|
||||
'instruction_following': 1.2,
|
||||
'linguistic_nuance': 1.1,
|
||||
'creativity': 1.0,
|
||||
'conversational_depth': 1.0
|
||||
};
|
||||
return `• ${dim.replace(/_/g, ' ')}: ${data.score.toFixed(1)} × ${weights[dim] || 1.0}`;
|
||||
}).join('<br>')}<br><br>
|
||||
Normalized to 0-100 scale
|
||||
</span>
|
||||
<div class="progress-bar">
|
||||
<div class="progress-fill" style="width: ${metrics.iq_score}%">
|
||||
${metrics.iq_score.toFixed(1)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tooltip">
|
||||
<strong>Adaptability:</strong>
|
||||
<span class="tooltiptext">
|
||||
<strong>Cross-Category Performance:</strong><br><br>
|
||||
Measures versatility across different task types.<br><br>
|
||||
Formula: (Categories with avg ≥ 2.5) / (Total categories) × 100<br><br>
|
||||
Higher score = more versatile model
|
||||
</span>
|
||||
<div class="progress-bar">
|
||||
<div class="progress-fill" style="width: ${metrics.adaptability}%">
|
||||
${metrics.adaptability.toFixed(1)}%
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="tooltip">
|
||||
<strong>Problem-Solving Depth:</strong>
|
||||
<span class="tooltiptext">
|
||||
<strong>Performance on Challenging Tasks:</strong><br><br>
|
||||
Average score on "hard" and "very_hard" difficulty tests.<br><br>
|
||||
Formula: (Avg score on hard tests) × 20<br><br>
|
||||
Tests critical thinking and complex reasoning
|
||||
</span>
|
||||
<div class="progress-bar">
|
||||
<div class="progress-fill" style="width: ${metrics.problem_solving_depth}%">
|
||||
${metrics.problem_solving_depth.toFixed(1)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h4 style="margin-top: 20px; color: #764ba2;">Cognitive Dimensions:</h4>
|
||||
<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 10px; margin-top: 10px;">
|
||||
`;
|
||||
|
||||
const dimensionWeights = {
|
||||
'logical_reasoning': 1.5,
|
||||
'mathematical_ability': 1.3,
|
||||
'technical_knowledge': 1.4,
|
||||
'instruction_following': 1.2,
|
||||
'linguistic_nuance': 1.1,
|
||||
'creativity': 1.0,
|
||||
'conversational_depth': 1.0
|
||||
};
|
||||
|
||||
for (const [dim, data] of Object.entries(metrics.dimensions)) {
|
||||
const weight = dimensionWeights[dim] || 1.0;
|
||||
html += `
|
||||
<div class="tooltip">
|
||||
<small>${dim.replace(/_/g, ' ').toUpperCase()}</small>
|
||||
<span class="tooltiptext">
|
||||
<strong>${dim.replace(/_/g, ' ').toUpperCase()}</strong><br><br>
|
||||
Score: <code>${data.score.toFixed(2)}/5.00</code><br>
|
||||
Weight in IQ: <code>${weight}</code><br>
|
||||
Tests evaluated: <code>${data.count}</code><br><br>
|
||||
Normalized: ${data.normalized.toFixed(1)}%
|
||||
</span>
|
||||
<div class="progress-bar" style="height: 20px;">
|
||||
<div class="progress-fill" style="width: ${data.normalized}%; font-size: 0.8em;">
|
||||
${data.score.toFixed(1)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
|
||||
html += `
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
|
||||
document.getElementById('intelligenceMetrics').innerHTML = html;
|
||||
|
||||
} catch (error) {
|
||||
console.error('Error loading intelligence metrics:', error);
|
||||
document.getElementById('intelligenceMetrics').innerHTML =
|
||||
'<p class="loading">Error loading intelligence metrics</p>';
|
||||
}
|
||||
}
|
||||
|
||||
function populateModelSelector() {
|
||||
if (!comparisonData) return;
|
||||
|
||||
const models = Object.keys(comparisonData.models);
|
||||
const select = document.getElementById('modelSelect');
|
||||
|
||||
select.innerHTML = '<option value="">Select a model...</option>';
|
||||
models.forEach(model => {
|
||||
const option = document.createElement('option');
|
||||
option.value = model;
|
||||
option.textContent = model;
|
||||
select.appendChild(option);
|
||||
});
|
||||
|
||||
// Populate category filter
|
||||
const categoryFilter = document.getElementById('filterCategory');
|
||||
categoryFilter.innerHTML = '<option value="">All Categories</option>';
|
||||
comparisonData.categories.forEach(cat => {
|
||||
const option = document.createElement('option');
|
||||
option.value = cat;
|
||||
option.textContent = cat;
|
||||
categoryFilter.appendChild(option);
|
||||
});
|
||||
|
||||
// Populate category chart selector
|
||||
const categorySelect = document.getElementById('categorySelect');
|
||||
categorySelect.innerHTML = '';
|
||||
comparisonData.categories.forEach(cat => {
|
||||
const option = document.createElement('option');
|
||||
option.value = cat;
|
||||
option.textContent = cat;
|
||||
categorySelect.appendChild(option);
|
||||
});
|
||||
|
||||
if (comparisonData.categories.length > 0) {
|
||||
updateCategoryChart();
|
||||
}
|
||||
}
|
||||
|
||||
function updateCategoryChart() {
|
||||
if (!comparisonData) return;
|
||||
|
||||
const category = document.getElementById('categorySelect').value;
|
||||
const models = Object.keys(comparisonData.models);
|
||||
|
||||
const data = models.map(model => {
|
||||
const stats = comparisonData.models[model].category_stats[category];
|
||||
return stats ? stats.average : 0;
|
||||
});
|
||||
|
||||
const ctx = document.getElementById('categoryChart');
|
||||
if (window.categoryChartInstance) {
|
||||
window.categoryChartInstance.destroy();
|
||||
}
|
||||
|
||||
window.categoryChartInstance = new Chart(ctx, {
|
||||
type: 'bar',
|
||||
data: {
|
||||
labels: models,
|
||||
datasets: [{
|
||||
label: `${category} - Average Score`,
|
||||
data: data,
|
||||
backgroundColor: 'rgba(102, 126, 234, 0.6)',
|
||||
borderColor: 'rgba(102, 126, 234, 1)',
|
||||
borderWidth: 2
|
||||
}]
|
||||
},
|
||||
options: {
|
||||
responsive: true,
|
||||
maintainAspectRatio: false,
|
||||
scales: {
|
||||
y: {
|
||||
beginAtZero: true,
|
||||
max: 5
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
async function loadModelDetails() {
|
||||
const modelName = document.getElementById('modelSelect').value;
|
||||
if (!modelName || !comparisonData) return;
|
||||
|
||||
currentModelDetails = comparisonData.models[modelName].test_results;
|
||||
displayDetailsTable(currentModelDetails);
|
||||
}
|
||||
|
||||
function displayDetailsTable(results) {
|
||||
let html = `
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th onclick="sortTable('test_name')">Test Name</th>
|
||||
<th onclick="sortTable('category')">Category</th>
|
||||
<th onclick="sortTable('difficulty')">Difficulty</th>
|
||||
<th onclick="sortTable('score')">Score</th>
|
||||
<th onclick="sortTable('generation_time')">Time (s)</th>
|
||||
<th onclick="sortTable('tokens')">Tokens</th>
|
||||
<th onclick="sortTable('status')">Status</th>
|
||||
<th>Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
`;
|
||||
|
||||
results.forEach(test => {
|
||||
const scoreClass = test.score >= 4 ? 'exceptional' : test.score >= 2 ? 'pass' : 'fail';
|
||||
const scoreDisplay = test.score !== null ? test.score.toFixed(1) : 'N/A';
|
||||
|
||||
// Extract timing and token info
|
||||
const genTime = test.generation_time ? test.generation_time.toFixed(2) : 'N/A';
|
||||
let tokenInfo = 'N/A';
|
||||
let tokensPerSec = '';
|
||||
|
||||
if (test.api_metrics && test.api_metrics.usage) {
|
||||
const usage = test.api_metrics.usage;
|
||||
const totalTokens = usage.total_tokens || usage.eval_count || 'N/A';
|
||||
const completionTokens = usage.completion_tokens || usage.eval_count;
|
||||
|
||||
if (totalTokens !== 'N/A') {
|
||||
tokenInfo = totalTokens.toString();
|
||||
|
||||
// Calculate tokens/sec if we have both values
|
||||
if (test.generation_time && completionTokens) {
|
||||
const tps = completionTokens / test.generation_time;
|
||||
tokensPerSec = `<br><small>(${tps.toFixed(1)} t/s)</small>`;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
html += `
|
||||
<tr>
|
||||
<td><strong>${test.test_name}</strong></td>
|
||||
<td>${test.category}</td>
|
||||
<td>${test.difficulty}</td>
|
||||
<td><span class="score-badge score-${scoreClass}">${scoreDisplay}</span></td>
|
||||
<td>${genTime}</td>
|
||||
<td>${tokenInfo}${tokensPerSec}</td>
|
||||
<td>${test.status}</td>
|
||||
<td><small>${test.notes}</small></td>
|
||||
</tr>
|
||||
`;
|
||||
});
|
||||
|
||||
html += '</tbody></table>';
|
||||
document.getElementById('detailsTable').innerHTML = html;
|
||||
}
|
||||
|
||||
function filterTable() {
|
||||
if (!currentModelDetails) return;
|
||||
|
||||
const searchTerm = document.getElementById('searchInput').value.toLowerCase();
|
||||
const categoryFilter = document.getElementById('filterCategory').value;
|
||||
const scoreFilter = document.getElementById('filterScore').value;
|
||||
|
||||
const filtered = currentModelDetails.filter(test => {
|
||||
const matchesSearch = test.test_name.toLowerCase().includes(searchTerm) ||
|
||||
test.category.toLowerCase().includes(searchTerm);
|
||||
const matchesCategory = !categoryFilter || test.category === categoryFilter;
|
||||
|
||||
let matchesScore = true;
|
||||
if (scoreFilter === 'exceptional') matchesScore = test.score >= 4;
|
||||
else if (scoreFilter === 'pass') matchesScore = test.score >= 2 && test.score < 4;
|
||||
else if (scoreFilter === 'fail') matchesScore = test.score < 2;
|
||||
|
||||
return matchesSearch && matchesCategory && matchesScore;
|
||||
});
|
||||
|
||||
displayDetailsTable(filtered);
|
||||
}
|
||||
|
||||
function sortTable(column) {
|
||||
if (!currentModelDetails) return;
|
||||
|
||||
currentModelDetails.sort((a, b) => {
|
||||
if (column === 'score') {
|
||||
return (b[column] || 0) - (a[column] || 0);
|
||||
}
|
||||
return (a[column] || '').toString().localeCompare((b[column] || '').toString());
|
||||
});
|
||||
|
||||
filterTable();
|
||||
}
|
||||
|
||||
// Initialize on load
|
||||
initDashboard();
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
559
test_suite.yaml
559
test_suite.yaml
@@ -1,9 +1,16 @@
|
||||
# AI Model Evaluation Test Suite
|
||||
# Focus: General reasoning + IT Forensics (Academic)
|
||||
# AI Model Evaluation Test Suite - Enhanced Version
|
||||
# Based on performance analysis of gemma3:4b-it-qat results
|
||||
# Strengthened tests in categories where model performed too well
|
||||
# Added multilingual challenges
|
||||
|
||||
metadata:
|
||||
version: "1.0"
|
||||
version: "2.0"
|
||||
author: "AI Evaluation Framework"
|
||||
changes_from_v1:
|
||||
- "Added harder variants for Creative Writing, Language Nuance, Code Generation"
|
||||
- "Added Multilingual category with 4 tests"
|
||||
- "Ensured minimum 3 tests per category at varying difficulties"
|
||||
- "Strengthened instruction-following constraints"
|
||||
focus_areas:
|
||||
- Logic & Reasoning
|
||||
- Mathematics & Calculation
|
||||
@@ -11,10 +18,11 @@ metadata:
|
||||
- Creative Writing
|
||||
- Code Generation
|
||||
- Language Nuance
|
||||
- Problem Solving & Logistics
|
||||
- IT Forensics
|
||||
- Multilingual Competence
|
||||
- Multi-turn Conversations
|
||||
|
||||
# Scoring rubric for all tests
|
||||
scoring_rubric:
|
||||
fail:
|
||||
score: 0-1
|
||||
@@ -26,10 +34,9 @@ scoring_rubric:
|
||||
score: 4-5
|
||||
description: "Exceeds requirements, demonstrates deep understanding"
|
||||
|
||||
# Individual test categories
|
||||
test_categories:
|
||||
|
||||
# ========== GENERAL REASONING TESTS ==========
|
||||
# ========== LOGIC & REASONING (3 tests) ==========
|
||||
|
||||
- category: "Logic & Reasoning"
|
||||
tests:
|
||||
@@ -49,10 +56,43 @@ test_categories:
|
||||
prompt: "If it was two hours ago, it would have been as long after 1:00 PM as it was before 1:00 PM today. What time is it now? Explain your deduction step-by-step."
|
||||
evaluation_criteria:
|
||||
- "Shows algebraic setup: (t-2) - 13:00 = 13:00 - (t-2)"
|
||||
- "Correct answer: 5:00 PM (17:00)"
|
||||
- "Correct answer: 3:00 PM (15:00)"
|
||||
- "Clear step-by-step reasoning"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "logic_03"
|
||||
name: "Multi-Constraint Deduction"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Five houses in a row are painted different colors. Their owners are from different countries, drink different beverages, smoke different brands, and keep different pets.
|
||||
|
||||
Facts:
|
||||
1. The Brit lives in the red house.
|
||||
2. The Swede keeps dogs.
|
||||
3. The Dane drinks tea.
|
||||
4. The green house is immediately to the left of the white house.
|
||||
5. The owner of the green house drinks coffee.
|
||||
6. The person who smokes Pall Mall keeps birds.
|
||||
7. The owner of the yellow house smokes Dunhill.
|
||||
8. The person in the center house drinks milk.
|
||||
9. The Norwegian lives in the first house.
|
||||
10. The person who smokes Blend lives next to the one who keeps cats.
|
||||
11. The person who keeps horses lives next to the one who smokes Dunhill.
|
||||
12. The person who smokes Blue Master drinks beer.
|
||||
13. The German smokes Prince.
|
||||
14. The Norwegian lives next to the blue house.
|
||||
15. The person who smokes Blend has a neighbor who drinks water.
|
||||
|
||||
Who owns the fish?
|
||||
evaluation_criteria:
|
||||
- "Systematically works through constraints"
|
||||
- "Correctly identifies the German owns the fish"
|
||||
- "Shows logical deduction process"
|
||||
- "Handles constraint propagation correctly"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== MATHEMATICS & CALCULATION (3 tests) ==========
|
||||
|
||||
- category: "Mathematics & Calculation"
|
||||
tests:
|
||||
- id: "math_01"
|
||||
@@ -73,10 +113,30 @@ test_categories:
|
||||
evaluation_criteria:
|
||||
- "Correct unit conversions (gallons to liters, miles to km)"
|
||||
- "Accurate fuel consumption calculation"
|
||||
- "Remaining range calculation: approximately 570-580 km"
|
||||
- "Remaining range calculation: approximately 475 km"
|
||||
- "Shows intermediate steps"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "math_03"
|
||||
name: "Compound Interest with Variable Rates and Withdrawals"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
An investment account starts with $10,000. The following occurs:
|
||||
- Year 1: 5% annual interest, compounded quarterly
|
||||
- Year 2: 4.5% annual interest, compounded monthly, with a $500 withdrawal at the end of Q2
|
||||
- Year 3: 6% annual interest, compounded daily (assume 365 days), with a $1,000 deposit at the start of the year
|
||||
|
||||
Calculate the final balance at the end of Year 3. Show all intermediate calculations with at least 2 decimal places precision.
|
||||
evaluation_criteria:
|
||||
- "Correct Year 1 calculation with quarterly compounding"
|
||||
- "Correct Year 2 with monthly compounding and mid-year withdrawal"
|
||||
- "Correct Year 3 with daily compounding and initial deposit"
|
||||
- "Final answer approximately $11,847-$11,850"
|
||||
- "Shows all intermediate steps"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== INSTRUCTION FOLLOWING (4 tests) ==========
|
||||
|
||||
- category: "Instruction Following"
|
||||
tests:
|
||||
- id: "instr_01"
|
||||
@@ -101,8 +161,52 @@ test_categories:
|
||||
- "No forbidden words (particle, physics, Einstein)"
|
||||
- "Third sentence is a question"
|
||||
- "Ends with 'connected'"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "instr_03"
|
||||
name: "Acrostic Technical Explanation"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Write a 7-sentence explanation of how blockchain technology works.
|
||||
|
||||
Constraints:
|
||||
1. The first letter of each sentence must spell out "SECURED" (S-E-C-U-R-E-D)
|
||||
2. Sentence 3 must contain exactly 15 words
|
||||
3. Sentence 5 must be a rhetorical question
|
||||
4. You cannot use the words "Bitcoin", "cryptocurrency", or "mining"
|
||||
5. The explanation must mention "consensus mechanism" at least once
|
||||
6. Total word count must be between 80-100 words
|
||||
evaluation_criteria:
|
||||
- "First letters spell SECURED"
|
||||
- "Sentence 3 has exactly 15 words"
|
||||
- "Sentence 5 is a rhetorical question"
|
||||
- "No forbidden words"
|
||||
- "Contains 'consensus mechanism'"
|
||||
- "Word count 80-100"
|
||||
- "Technically accurate"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "instr_04"
|
||||
name: "Structured Data Extraction with Format"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Read this text and extract information in the EXACT format specified:
|
||||
|
||||
"Dr. Maria Santos-Ferreira, aged 47, joined TechCorp Industries on March 15, 2019 as Chief Technology Officer. She previously worked at DataSystems Inc. for 12 years. Her annual salary is $425,000 with a 15% bonus structure. She holds patents US2018/0012345 and EU2020/9876543. Contact: msantos@techcorp.com, +1-555-0147."
|
||||
|
||||
Output format (must match exactly, including brackets and pipes):
|
||||
[NAME] | [AGE] | [COMPANY] | [ROLE] | [START_DATE:YYYY-MM-DD] | [PREV_EMPLOYER] | [PREV_YEARS] | [SALARY_USD] | [BONUS_%] | [PATENTS:semicolon-separated] | [EMAIL] | [PHONE]
|
||||
evaluation_criteria:
|
||||
- "Exact format match with pipes and brackets"
|
||||
- "Correct date format conversion (2019-03-15)"
|
||||
- "Salary as number without $ or comma"
|
||||
- "Bonus as number without %"
|
||||
- "Patents semicolon-separated"
|
||||
- "All 12 fields present and correct"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
# ========== CREATIVE WRITING (4 tests - added harder variants) ==========
|
||||
|
||||
- category: "Creative Writing"
|
||||
tests:
|
||||
- id: "creative_01"
|
||||
@@ -129,6 +233,52 @@ test_categories:
|
||||
- "Atmospheric and evocative"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "creative_03"
|
||||
name: "Unreliable Narrator Technical Document"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Write a 3-paragraph product manual excerpt for a "Time Displacement Device" from the perspective of an unreliable narrator who is clearly lying or delusional, but the text must still function as a technically coherent manual.
|
||||
|
||||
Requirements:
|
||||
1. Include at least 3 numbered safety warnings that are subtly absurd but grammatically serious
|
||||
2. The narrator must contradict themselves at least twice
|
||||
3. Include one footnote that undermines the main text
|
||||
4. Do not use exclamation marks anywhere
|
||||
5. Maintain formal technical writing style throughout
|
||||
6. Do not explicitly state the narrator is unreliable
|
||||
evaluation_criteria:
|
||||
- "3 paragraphs"
|
||||
- "3+ numbered safety warnings (absurd but formal)"
|
||||
- "At least 2 self-contradictions"
|
||||
- "Footnote that undermines text"
|
||||
- "No exclamation marks"
|
||||
- "Formal technical style maintained"
|
||||
- "Unreliability shown not told"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "creative_04"
|
||||
name: "Reverse Chronology Micro-Fiction"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Write a complete 5-sentence story told in reverse chronological order (last event first, first event last). The story must be about a scientist making a discovery.
|
||||
|
||||
Additional constraints:
|
||||
- Each sentence must be from a different point in time (clearly distinguishable)
|
||||
- The true meaning of the story should only become clear when you reach the "first" event (last sentence)
|
||||
- Include at least one piece of dialogue
|
||||
- The word count must be exactly 75 words (not 74, not 76)
|
||||
evaluation_criteria:
|
||||
- "Exactly 5 sentences"
|
||||
- "Clear reverse chronological order"
|
||||
- "About a scientist's discovery"
|
||||
- "Each sentence distinct time point"
|
||||
- "Meaning emerges at end"
|
||||
- "Contains dialogue"
|
||||
- "Exactly 75 words"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== CODE GENERATION (4 tests) ==========
|
||||
|
||||
- category: "Code Generation"
|
||||
tests:
|
||||
- id: "code_01"
|
||||
@@ -154,6 +304,55 @@ test_categories:
|
||||
- "Three distinct test cases provided"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "code_03"
|
||||
name: "Concurrent Rate Limiter"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Write a Python class `RateLimiter` that implements a token bucket rate limiter with the following requirements:
|
||||
|
||||
1. Constructor takes `rate` (tokens per second) and `capacity` (max tokens)
|
||||
2. Method `acquire(tokens=1)` that returns True if tokens available, False otherwise
|
||||
3. Method `wait_and_acquire(tokens=1)` that blocks until tokens are available (use asyncio)
|
||||
4. Must be thread-safe for the synchronous `acquire` method
|
||||
5. Include a method `get_available_tokens()` that returns current token count
|
||||
|
||||
Provide a complete implementation with:
|
||||
- Proper time-based token replenishment
|
||||
- A test demonstrating both sync and async usage
|
||||
- Handle edge case where requested tokens > capacity
|
||||
evaluation_criteria:
|
||||
- "Correct token bucket algorithm"
|
||||
- "Thread-safe synchronous acquire"
|
||||
- "Working async wait_and_acquire"
|
||||
- "Proper time-based replenishment"
|
||||
- "Edge case handling"
|
||||
- "Complete test code"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "code_04"
|
||||
name: "SQL Query Builder with Injection Prevention"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Write a Python class `SafeQueryBuilder` that builds SELECT SQL queries with the following features:
|
||||
|
||||
1. Fluent interface: `builder.select('name', 'age').from_table('users').where('age', '>', 18).where('status', '=', 'active').order_by('name').limit(10).build()`
|
||||
2. Must prevent SQL injection - all values must be parameterized
|
||||
3. The `build()` method returns a tuple of (query_string, parameters_list)
|
||||
4. Support for: SELECT, FROM, WHERE (multiple), ORDER BY, LIMIT, OFFSET
|
||||
5. WHERE conditions can use: =, !=, >, <, >=, <=, LIKE, IN
|
||||
|
||||
Show the output for a query that selects users where name LIKE '%john%' AND age IN (25, 30, 35) ordered by created_at DESC with limit 5.
|
||||
evaluation_criteria:
|
||||
- "Fluent interface pattern correct"
|
||||
- "SQL injection prevention via parameterization"
|
||||
- "Returns (query, params) tuple"
|
||||
- "All operations supported"
|
||||
- "WHERE with IN clause works"
|
||||
- "Example output is correct and safe"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
# ========== LANGUAGE NUANCE (4 tests - added harder variants) ==========
|
||||
|
||||
- category: "Language Nuance"
|
||||
tests:
|
||||
- id: "nuance_01"
|
||||
@@ -181,6 +380,60 @@ test_categories:
|
||||
- "Demonstrates understanding of pragmatics"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "nuance_03"
|
||||
name: "Register Shifting and Code-Switching"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Rewrite the following message in FOUR different registers, maintaining the same core information but adjusting tone, vocabulary, and structure appropriately:
|
||||
|
||||
Original: "The quarterly report shows we lost money because our main product didn't sell well and we spent too much on advertising."
|
||||
|
||||
Rewrite for:
|
||||
1. A formal board presentation (C-suite executives)
|
||||
2. A casual Slack message to your team
|
||||
3. A legal disclosure document
|
||||
4. An email to a non-English speaking business partner (using simple, clear language)
|
||||
|
||||
After the four rewrites, explain three specific linguistic changes you made for each register and why.
|
||||
evaluation_criteria:
|
||||
- "Board version uses formal financial terminology"
|
||||
- "Slack version uses casual/colloquial language appropriately"
|
||||
- "Legal version uses hedging, passive voice, precise language"
|
||||
- "Simple version avoids idioms and complex structures"
|
||||
- "Identifies 3 specific changes per register"
|
||||
- "Explanations demonstrate metalinguistic awareness"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "nuance_04"
|
||||
name: "Implicature and Presupposition Detection"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Analyze the following dialogue for all implicatures, presuppositions, and indirect speech acts:
|
||||
|
||||
A: "Have you finished the Anderson report yet?"
|
||||
B: "I've been dealing with the server outage all morning."
|
||||
A: "Right. Well, the client is flying in tomorrow."
|
||||
B: "I noticed you CC'd the whole department on that email."
|
||||
A: "Just keeping everyone in the loop."
|
||||
|
||||
For each line, identify:
|
||||
1. What is directly stated (locution)
|
||||
2. What is implied but not stated (implicature)
|
||||
3. What is assumed to be true (presupposition)
|
||||
4. What action is being performed through speech (illocutionary force)
|
||||
|
||||
Then explain the underlying conflict or tension this exchange reveals.
|
||||
evaluation_criteria:
|
||||
- "Correctly identifies B's implicature (excuse/reason for not finishing)"
|
||||
- "Identifies A's implied criticism in 'Right. Well...'"
|
||||
- "Recognizes B's counter-accusation in CC comment"
|
||||
- "Identifies presuppositions (report exists, server outage occurred)"
|
||||
- "Correctly labels illocutionary acts (request, excuse, threat, accusation)"
|
||||
- "Explains underlying workplace tension/conflict"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== PROBLEM SOLVING & LOGISTICS (3 tests) ==========
|
||||
|
||||
- category: "Problem Solving & Logistics"
|
||||
tests:
|
||||
- id: "logistics_01"
|
||||
@@ -207,8 +460,34 @@ test_categories:
|
||||
- "Reaches exactly 500 kg total"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== IT FORENSICS TESTS ==========
|
||||
- id: "logistics_03"
|
||||
name: "Resource Scheduling with Constraints"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Schedule these 6 tasks across 3 workers (A, B, C) to minimize total completion time:
|
||||
|
||||
Task 1: 2 hours, requires Worker A or B, must complete before Task 4
|
||||
Task 2: 3 hours, any worker, must complete before Task 5
|
||||
Task 3: 1 hour, requires Worker C only, no dependencies
|
||||
Task 4: 2 hours, requires Worker B or C, depends on Task 1
|
||||
Task 5: 4 hours, requires Worker A only, depends on Task 2
|
||||
Task 6: 2 hours, any worker, depends on Tasks 3 and 4
|
||||
|
||||
Provide:
|
||||
1. A timeline showing when each task starts and ends
|
||||
2. Which worker does each task
|
||||
3. The total completion time
|
||||
4. Explain why this is optimal (or near-optimal)
|
||||
evaluation_criteria:
|
||||
- "Respects all worker constraints"
|
||||
- "Respects all dependencies"
|
||||
- "Provides clear timeline"
|
||||
- "Achieves reasonable completion time (≤9 hours possible)"
|
||||
- "Explains optimization reasoning"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
# ========== IT FORENSICS - FILE SYSTEMS (3 tests) ==========
|
||||
|
||||
- category: "IT Forensics - File Systems"
|
||||
tests:
|
||||
- id: "forensics_mft_01"
|
||||
@@ -281,6 +560,8 @@ test_categories:
|
||||
- "Explains significance of magic numbers"
|
||||
expected_difficulty: "medium"
|
||||
|
||||
# ========== IT FORENSICS - REGISTRY & ARTIFACTS (3 tests) ==========
|
||||
|
||||
- category: "IT Forensics - Registry & Artifacts"
|
||||
tests:
|
||||
- id: "forensics_registry_01"
|
||||
@@ -323,6 +604,27 @@ test_categories:
|
||||
- "Explains conversion steps"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "forensics_prefetch_01"
|
||||
name: "Windows Prefetch Analysis"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
A Windows prefetch file is named: NOTEPAD.EXE-D4A5B5E5.pf
|
||||
|
||||
Questions:
|
||||
1) What does the hash portion (D4A5B5E5) represent?
|
||||
2) If you found multiple prefetch files for the same executable with different hashes, what would that indicate?
|
||||
3) What forensically relevant information can typically be extracted from prefetch files?
|
||||
4) In which Windows versions is prefetch enabled by default, and where are these files stored?
|
||||
evaluation_criteria:
|
||||
- "Hash represents file path (or explains path-based hashing)"
|
||||
- "Different hashes = different paths/locations for same exe"
|
||||
- "Lists: execution count, timestamps, loaded DLLs, files accessed"
|
||||
- "Knows location (C:\\Windows\\Prefetch) and version availability"
|
||||
- "Demonstrates practical forensic understanding"
|
||||
expected_difficulty: "medium"
|
||||
|
||||
# ========== IT FORENSICS - MEMORY & NETWORK (3 tests) ==========
|
||||
|
||||
- category: "IT Forensics - Memory & Network"
|
||||
tests:
|
||||
- id: "forensics_memory_01"
|
||||
@@ -371,6 +673,33 @@ test_categories:
|
||||
- "Shows understanding of TCP header structure"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "forensics_pcap_01"
|
||||
name: "PCAP Three-Way Handshake Analysis"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Given these three TCP packets from a capture (simplified):
|
||||
|
||||
Packet 1: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=SYN, Seq=1000, Ack=0
|
||||
Packet 2: 93.184.216.34:80 -> 10.0.0.5:49152, Flags=SYN,ACK, Seq=5000, Ack=???
|
||||
Packet 3: 10.0.0.5:49152 -> 93.184.216.34:80, Flags=ACK, Seq=???, Ack=???
|
||||
|
||||
Questions:
|
||||
1) Fill in the missing Ack value for Packet 2
|
||||
2) Fill in the missing Seq and Ack values for Packet 3
|
||||
3) What is the client IP and what is the server IP?
|
||||
4) What service is likely being accessed?
|
||||
5) After this handshake, what sequence number will the client use for its first data byte?
|
||||
evaluation_criteria:
|
||||
- "Packet 2 Ack = 1001"
|
||||
- "Packet 3 Seq = 1001, Ack = 5001"
|
||||
- "Client: 10.0.0.5, Server: 93.184.216.34"
|
||||
- "Service: HTTP (port 80)"
|
||||
- "First data byte seq = 1001"
|
||||
- "Demonstrates understanding of TCP handshake mechanics"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
# ========== IT FORENSICS - TIMELINE & LOG ANALYSIS (3 tests) ==========
|
||||
|
||||
- category: "IT Forensics - Timeline & Log Analysis"
|
||||
tests:
|
||||
- id: "forensics_timeline_01"
|
||||
@@ -399,6 +728,147 @@ test_categories:
|
||||
- "Identifies this as potential compromise scenario"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "forensics_timeline_02"
|
||||
name: "Anti-Forensics Detection"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Analyze these filesystem timestamps for a file 'financial_report.xlsx':
|
||||
|
||||
- Created (crtime): 2024-03-15 09:30:00
|
||||
- Modified (mtime): 2024-03-14 16:45:00
|
||||
- Accessed (atime): 2024-03-15 10:00:00
|
||||
- Changed (ctime): 2024-03-15 09:30:00
|
||||
|
||||
And these additional artifacts:
|
||||
- $MFT entry shows file created 2024-03-15
|
||||
- $UsnJrnl shows rename from 'temp_8x7k2.xlsx' to 'financial_report.xlsx' at 2024-03-15 09:30:00
|
||||
- $LogFile shows no entries for this file before 2024-03-15
|
||||
|
||||
What anomalies exist and what do they suggest about the file's history?
|
||||
evaluation_criteria:
|
||||
- "Identifies mtime < crtime anomaly (impossible normally)"
|
||||
- "Recognizes timestamp manipulation/timestomping"
|
||||
- "Notes rename from suspicious temp filename"
|
||||
- "Correlates $UsnJrnl rename evidence"
|
||||
- "Understands ctime cannot be easily forged"
|
||||
- "Suggests file was likely copied/moved with modified timestamps"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
- id: "forensics_timeline_03"
|
||||
name: "Windows Event Log Correlation"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Correlate these Windows Event Log entries:
|
||||
|
||||
Security Log:
|
||||
- Event 4624 (Logon): User CORP\jdoe, Type 10 (RemoteInteractive), 2024-06-01 02:15:33, Source: 192.168.1.50
|
||||
- Event 4672 (Special Privileges): User CORP\jdoe, Privileges: SeDebugPrivilege, SeBackupPrivilege
|
||||
- Event 4688 (Process Created): cmd.exe by CORP\jdoe, 02:16:01
|
||||
- Event 4688 (Process Created): powershell.exe by CORP\jdoe, 02:16:15, CommandLine: "-ep bypass -enc SQBFAFgA..."
|
||||
|
||||
System Log:
|
||||
- Event 7045 (Service Installed): "Windows Update Helper", 02:17:30
|
||||
|
||||
What type of attack pattern does this represent? What would be your next investigative steps?
|
||||
evaluation_criteria:
|
||||
- "Identifies RDP logon (Type 10)"
|
||||
- "Recognizes privilege escalation indicators"
|
||||
- "Identifies encoded PowerShell (likely malicious)"
|
||||
- "Recognizes service installation for persistence"
|
||||
- "Identifies late-night timing as suspicious"
|
||||
- "Suggests checking service binary, decoding PowerShell, network logs"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
# ========== MULTILINGUAL COMPETENCE (4 tests - NEW CATEGORY) ==========
|
||||
|
||||
- category: "Multilingual Competence"
|
||||
tests:
|
||||
- id: "multilingual_01"
|
||||
name: "Cross-Language Instruction Following"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Follow these instructions, which are given in three different languages. Your response must address all three:
|
||||
|
||||
English: Write one sentence explaining what machine learning is.
|
||||
Deutsch: Schreiben Sie einen Satz, der erklärt, warum maschinelles Lernen wichtig ist.
|
||||
Español: Escriba una oración dando un ejemplo de aplicación del aprendizaje automático.
|
||||
|
||||
Respond to each instruction in the language it was given.
|
||||
evaluation_criteria:
|
||||
- "English response is in English and accurate"
|
||||
- "German response is in German and grammatically correct"
|
||||
- "Spanish response is in Spanish and grammatically correct"
|
||||
- "All three are topically coherent (about ML)"
|
||||
- "Each is exactly one sentence"
|
||||
expected_difficulty: "medium"
|
||||
|
||||
- id: "multilingual_02"
|
||||
name: "Translation with Technical Terminology Preservation"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Translate the following technical paragraph into French and Japanese. Preserve technical terms that are commonly used untranslated in those languages (e.g., 'API' typically stays as 'API').
|
||||
|
||||
"The microservices architecture implements a RESTful API gateway that handles authentication via OAuth 2.0 tokens. The backend uses a Kubernetes cluster with horizontal pod autoscaling, while the database layer employs PostgreSQL with read replicas for improved throughput."
|
||||
|
||||
After translating, list which technical terms you kept in English for each language and briefly explain why.
|
||||
evaluation_criteria:
|
||||
- "French translation is grammatically correct"
|
||||
- "Japanese translation is grammatically correct"
|
||||
- "Appropriate terms preserved (API, OAuth, Kubernetes, PostgreSQL)"
|
||||
- "Explains rationale for preserved terms"
|
||||
- "Technical meaning preserved accurately"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "multilingual_03"
|
||||
name: "Idiomatic Expression Cross-Mapping"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
For each of the following idiomatic expressions, provide:
|
||||
1. The literal translation
|
||||
2. The actual meaning
|
||||
3. An equivalent idiom in English (if the original isn't English) or in another language (if the original is English)
|
||||
|
||||
A) German: "Da steppt der Bär"
|
||||
B) Japanese: "猿も木から落ちる" (Saru mo ki kara ochiru)
|
||||
C) English: "It's raining cats and dogs"
|
||||
D) French: "Avoir le cafard"
|
||||
E) Spanish: "Estar en las nubes"
|
||||
|
||||
Then identify which two idioms from different languages express the most similar concept.
|
||||
evaluation_criteria:
|
||||
- "Correct literal translations for all 5"
|
||||
- "Correct meanings for all 5"
|
||||
- "Appropriate equivalent idioms provided"
|
||||
- "Correctly identifies similar pair (e.g., B and 'even experts make mistakes')"
|
||||
- "Demonstrates cross-cultural linguistic awareness"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "multilingual_04"
|
||||
name: "Code-Switched Dialogue Analysis"
|
||||
type: "single_turn"
|
||||
prompt: |
|
||||
Analyze this code-switched dialogue (English-Spanish) for a sociolinguistic study:
|
||||
|
||||
Speaker A: "Hey, did you finish el reporte for tomorrow's meeting?"
|
||||
Speaker B: "Almost, pero I'm stuck on the financial projections. Es muy complicado."
|
||||
Speaker A: "I can help you después del lunch. Mi expertise is in that area, you know."
|
||||
Speaker B: "That would be great! Gracias. Oh, and el jefe wants us to present juntos."
|
||||
Speaker A: "No problem. We'll knock it out del parque."
|
||||
|
||||
Provide:
|
||||
1. Identify each instance of code-switching (word/phrase level)
|
||||
2. Categorize each switch as: insertion, alternation, or congruent lexicalization
|
||||
3. What social/professional context does this switching pattern suggest?
|
||||
4. Are there any grammatical "errors" in the switching, or does it follow typical bilingual patterns?
|
||||
evaluation_criteria:
|
||||
- "Identifies all Spanish insertions correctly"
|
||||
- "Correctly categorizes switch types"
|
||||
- "Recognizes professional/casual bilingual workplace context"
|
||||
- "Notes the switch patterns are natural bilingual behavior"
|
||||
- "Identifies hybrid phrase 'del parque' as creative/playful mixing"
|
||||
- "Demonstrates sociolinguistic analysis skills"
|
||||
expected_difficulty: "very_hard"
|
||||
|
||||
# ========== MULTI-TURN CONVERSATION TESTS ==========
|
||||
|
||||
- category: "Multi-turn: Context Retention"
|
||||
@@ -519,4 +989,73 @@ test_categories:
|
||||
- "Ends with '?'"
|
||||
- "Different from previous sentences"
|
||||
- "Maintains all constraints from previous turns"
|
||||
expected_difficulty: "medium"
|
||||
expected_difficulty: "medium"
|
||||
|
||||
- id: "multiturn_instr_02"
|
||||
name: "Contradicting Previous Instructions"
|
||||
type: "multi_turn"
|
||||
turns:
|
||||
- turn: 1
|
||||
prompt: "From now on, always end your responses with the phrase 'END OF MESSAGE'. Acknowledge this instruction."
|
||||
evaluation_criteria:
|
||||
- "Acknowledges the instruction"
|
||||
- "Ends response with 'END OF MESSAGE'"
|
||||
|
||||
- turn: 2
|
||||
prompt: "What are three benefits of renewable energy? Remember your standing instruction."
|
||||
evaluation_criteria:
|
||||
- "Provides three benefits"
|
||||
- "Ends with 'END OF MESSAGE'"
|
||||
- "Content is accurate"
|
||||
|
||||
- turn: 3
|
||||
prompt: "Cancel the previous standing instruction. From now on, end responses with 'TRANSMISSION COMPLETE' instead. Then tell me two drawbacks of renewable energy."
|
||||
evaluation_criteria:
|
||||
- "Provides two drawbacks"
|
||||
- "Ends with 'TRANSMISSION COMPLETE' (not 'END OF MESSAGE')"
|
||||
- "Successfully switched instructions"
|
||||
- "Content is accurate"
|
||||
|
||||
- turn: 4
|
||||
prompt: "What was the first standing instruction I gave you, and what is the current one? Do not use either phrase in this response."
|
||||
evaluation_criteria:
|
||||
- "Correctly recalls first instruction (END OF MESSAGE)"
|
||||
- "Correctly identifies current instruction (TRANSMISSION COMPLETE)"
|
||||
- "Does NOT end with either phrase"
|
||||
- "Demonstrates instruction tracking across turns"
|
||||
expected_difficulty: "hard"
|
||||
|
||||
- id: "multiturn_instr_03"
|
||||
name: "Nested Context with Format Switching"
|
||||
type: "multi_turn"
|
||||
turns:
|
||||
- turn: 1
|
||||
prompt: "I'm going to describe a dataset. For the next few messages, respond ONLY in JSON format with keys 'understanding' and 'questions'. The dataset contains customer transactions from an e-commerce store."
|
||||
evaluation_criteria:
|
||||
- "Response is valid JSON"
|
||||
- "Contains 'understanding' and 'questions' keys"
|
||||
- "Content relates to e-commerce transactions"
|
||||
|
||||
- turn: 2
|
||||
prompt: "The dataset has columns: customer_id, timestamp, product_category, amount, payment_method. It covers January 2024."
|
||||
evaluation_criteria:
|
||||
- "Response is valid JSON"
|
||||
- "Contains 'understanding' and 'questions' keys"
|
||||
- "Understanding reflects the column information"
|
||||
|
||||
- turn: 3
|
||||
prompt: "STOP using JSON format. Now respond in plain bullet points. What analyses would you recommend for this dataset?"
|
||||
evaluation_criteria:
|
||||
- "Switches to bullet point format"
|
||||
- "NOT in JSON format"
|
||||
- "Recommendations are relevant to the dataset described"
|
||||
- "References information from previous turns"
|
||||
|
||||
- turn: 4
|
||||
prompt: "Switch back to JSON. Add a third key 'recommendations' with your top 3 analyses. Also include your understanding from turn 2."
|
||||
evaluation_criteria:
|
||||
- "Returns to JSON format"
|
||||
- "Has three keys: understanding, questions, recommendations"
|
||||
- "Recommendations from turn 3 included"
|
||||
- "Understanding references turn 2 context"
|
||||
expected_difficulty: "very_hard"
|
||||
Reference in New Issue
Block a user