initial commit
This commit is contained in:
251
README.md
251
README.md
@@ -1,2 +1,251 @@
|
||||
# llm-eval-forensics
|
||||
# AI Model Evaluation Framework
|
||||
|
||||
Comprehensive testing suite for evaluating AI models on general reasoning tasks and IT Forensics topics. Designed for testing quantized models (q4_K_M, q8, fp16) against academic and practical scenarios.
|
||||
|
||||
## Features
|
||||
|
||||
- **Comprehensive Test Coverage**
|
||||
- Logic & Reasoning
|
||||
- Mathematics & Calculations
|
||||
- Instruction Following
|
||||
- Creative Writing
|
||||
- Code Generation
|
||||
- Language Nuance
|
||||
- IT Forensics (MFT analysis, file signatures, registry, memory, network)
|
||||
- Multi-turn conversations with context retention
|
||||
|
||||
- **IT Forensics Focus**
|
||||
- Raw hex dump analysis (Master File Table)
|
||||
- File signature identification
|
||||
- Registry hive analysis
|
||||
- FILETIME conversions
|
||||
- Memory artifact extraction
|
||||
- TCP/IP header analysis
|
||||
- Timeline reconstruction
|
||||
|
||||
- **Automated Testing**
|
||||
- OpenAI-compatible API support (Ollama, LM Studio, etc.)
|
||||
- Interactive evaluation with scoring rubric
|
||||
- Progress tracking and auto-save
|
||||
- Multi-turn conversation handling
|
||||
|
||||
- **Analysis & Comparison**
|
||||
- Cross-model comparison reports
|
||||
- Category-wise performance breakdown
|
||||
- Difficulty-based analysis
|
||||
- CSV export for further analysis
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
```bash
|
||||
# Python 3.8+
|
||||
pip install pyyaml requests
|
||||
```
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Clone or download the files
|
||||
# Ensure these files are in your working directory:
|
||||
# - ai_eval.py
|
||||
# - analyze_results.py
|
||||
# - test_suite.yaml
|
||||
```
|
||||
|
||||
### Basic Usage
|
||||
|
||||
#### 1. Test a Single Model
|
||||
|
||||
```bash
|
||||
# For Ollama (default: http://localhost:11434)
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
|
||||
# For other endpoints with API key
|
||||
python ai_eval.py \
|
||||
--endpoint https://api.example.com \
|
||||
--api-key sk-your-key-here \
|
||||
--model your-model-name
|
||||
```
|
||||
|
||||
#### 2. Test Multiple Models (Quantization Comparison)
|
||||
|
||||
```bash
|
||||
# Test different quantizations of qwen3:4b
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-q8_0
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:4b-fp16
|
||||
|
||||
# Test different model sizes
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:8b-q4_K_M
|
||||
python ai_eval.py --endpoint http://localhost:11434 --model qwen3:14b-q4_K_M
|
||||
```
|
||||
|
||||
#### 3. Filter by Category
|
||||
|
||||
```bash
|
||||
# Test only IT Forensics categories
|
||||
python ai_eval.py \
|
||||
--endpoint http://localhost:11434 \
|
||||
--model qwen3:4b \
|
||||
--category "IT Forensics - File Systems"
|
||||
```
|
||||
|
||||
#### 4. Analyze Results
|
||||
|
||||
```bash
|
||||
# Compare all tested models
|
||||
python analyze_results.py --compare
|
||||
|
||||
# Detailed report for specific model
|
||||
python analyze_results.py --detail "qwen3:4b-q4_K_M"
|
||||
|
||||
# Export to CSV
|
||||
python analyze_results.py --export comparison.csv
|
||||
```
|
||||
|
||||
## Scoring Rubric
|
||||
|
||||
All tests are evaluated on a 0-5 scale:
|
||||
|
||||
| Score | Category | Description |
|
||||
|-------|----------|-------------|
|
||||
| 0-1 | **FAIL** | Major errors, fails to meet basic requirements |
|
||||
| 2-3 | **PASS** | Meets requirements with minor issues |
|
||||
| 4-5 | **EXCEPTIONAL** | Exceeds requirements, demonstrates deep understanding |
|
||||
|
||||
### Evaluation Criteria
|
||||
|
||||
#### Constraint Adherence
|
||||
|
||||
- Fail: Misses more than one constraint or forbidden word
|
||||
- Pass: Follows all constraints but flow is awkward
|
||||
- Exceptional: Follows all constraints with natural, fluid language
|
||||
|
||||
#### Unit Precision (for math/forensics)
|
||||
|
||||
- Fail: Errors in basic conversion
|
||||
- Pass: Correct conversions but rounding errors
|
||||
- Exceptional: Perfect precision across systems
|
||||
|
||||
#### Reasoning Path
|
||||
|
||||
- Fail: Gives only final answer without steps
|
||||
- Pass: Shows steps but logic contains "leaps"
|
||||
- Exceptional: Transparent, logical chain-of-thought
|
||||
|
||||
#### Code Safety
|
||||
|
||||
- Fail: Function crashes on bad input
|
||||
- Pass: Logic correct but lacks error handling
|
||||
- Exceptional: Production-ready with robust error catching
|
||||
|
||||
## Test Categories Overview
|
||||
|
||||
### General Reasoning (14 tests)
|
||||
|
||||
- Logic puzzles & temporal reasoning
|
||||
- Multi-step mathematics
|
||||
- Strict instruction following
|
||||
- Creative writing with constraints
|
||||
- Code generation
|
||||
- Language nuance understanding
|
||||
- Problem-solving & logistics
|
||||
|
||||
### IT Forensics (8 tests)
|
||||
|
||||
#### File Systems
|
||||
|
||||
- **MFT Basic Analysis**: Signature, status flags, sequence numbers
|
||||
- **MFT Advanced**: Update sequence arrays, LSN, attribute offsets
|
||||
- **File Signatures**: Magic number identification (JPEG, PNG, PDF, ZIP, RAR)
|
||||
|
||||
#### Registry & Artifacts
|
||||
|
||||
- **Registry Hive Headers**: Signature, sequence numbers, format version
|
||||
- **FILETIME Conversion**: Windows timestamp decoding
|
||||
|
||||
#### Memory & Network
|
||||
|
||||
- **Memory Artifacts**: HTTP request extraction from dumps
|
||||
- **TCP Headers**: Port, sequence, flags, window size analysis
|
||||
|
||||
#### Timeline Analysis
|
||||
|
||||
- **Event Reconstruction**: Log correlation, attack narrative building
|
||||
|
||||
### Multi-turn Conversations (3 tests)
|
||||
|
||||
- Progressive hex analysis (PE file structure)
|
||||
- Forensic investigation scenario
|
||||
- Technical depth building (NTFS ADS)
|
||||
|
||||
## File Structure
|
||||
|
||||
```bash
|
||||
.
|
||||
├── ai_eval.py # Main testing script
|
||||
├── analyze_results.py # Results analysis and comparison
|
||||
├── test_suite.yaml # Test definitions
|
||||
├── results/ # Auto-created results directory
|
||||
│ ├── qwen3_4b-q4_K_M_latest.json
|
||||
│ ├── qwen3_4b-q8_0_latest.json
|
||||
│ └── qwen3_4b-fp16_latest.json
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Test Suite
|
||||
|
||||
Edit `test_suite.yaml` to add your own tests:
|
||||
|
||||
```yaml
|
||||
- category: "Your Category"
|
||||
tests:
|
||||
- id: "custom_01"
|
||||
name: "Your Test Name"
|
||||
type: "single_turn" # or "multi_turn"
|
||||
prompt: "Your test prompt here"
|
||||
evaluation_criteria:
|
||||
- "Criterion 1"
|
||||
- "Criterion 2"
|
||||
expected_difficulty: "medium" # medium, hard, very_hard
|
||||
```
|
||||
|
||||
### Batch Testing Script
|
||||
|
||||
Create `batch_test.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
ENDPOINT="http://localhost:11434"
|
||||
|
||||
# Test all qwen3:4b quantizations
|
||||
for quant in q4_K_M q8_0 fp16; do
|
||||
echo "Testing qwen3:4b-${quant}..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:4b-${quant}"
|
||||
done
|
||||
|
||||
# Test all sizes with q4_K_M
|
||||
for size in 4b 8b 14b; do
|
||||
echo "Testing qwen3:${size}-q4_K_M..."
|
||||
python ai_eval.py --endpoint $ENDPOINT --model "qwen3:${size}-q4_K_M"
|
||||
done
|
||||
|
||||
# Generate comparison
|
||||
python analyze_results.py --compare
|
||||
```
|
||||
|
||||
### Custom Endpoint Configuration
|
||||
|
||||
For OpenAI-compatible cloud services:
|
||||
|
||||
```bash
|
||||
python ai_eval.py \
|
||||
--endpoint https://api.service.com \
|
||||
--api-key your-api-key \
|
||||
--model model-name
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user