Files
semeion/README.md
2025-12-17 22:08:13 +01:00

356 lines
16 KiB
Markdown

# semeion
![alt text](resources/title_image.png)
## Overview
Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"
Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
## Key Features
| Feature | Description |
|---------|-------------|
| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |
## The Problem Semeion Solves
### Problematic UX with existing forensic suites
Existing forensic tools have timeline interfaces that:
- Render slowly
- Show flat event lists with no context
- Require manual correlation across sources
- Have poor visualization and navigation
*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*
**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.
### Large Communication Datasets
Modern investigations involve:
- Seized databases with millions of relevant artifacts
- Multilingual content
- Slang, code words, and poor machine translation
- Hours spent reading irrelevant conversations
**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
## How It Works
### Two Entry Points
#### **Entry Point 1: Semantic content discovery → timeline analysis**
```bash
1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
```
#### **Entry Point 2: timestamp → content correlation**
```bash
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
- Communication spike 30min before (cluster: "Coordination Activity")
- Suspicious file downloads (cluster: "Malware Delivery")
- Network connections (cluster: "C2 Communication")
- Process executions leading to encryption
4. Complete attack chain visible at a glance
```
## Semantic Event Clustering
**Traditional Timeline View (Overwhelming):**
```bash
14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...
```
**Semeion Clustered View (Clear Pattern):**
```bash
┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand
│ Pattern: Data Exfiltration Sequence │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘
```
### How Clustering Works
1. **Temporal Proximity**: Events within configurable time window evaluated together
2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
3. **Entity Linking**: Shared file names, IPs, usernames connect events
4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
5. **LLM Summarization** (Optional): Natural language cluster descriptions
## Architecture
### Desktop Application (PySide6)
```bash
┌─────────────────────────────────────────────────────────┐
│ PySide6 Desktop UI │
│ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Timeline │ │ Semantic │ │ Artifact │ │
│ │ View │ │ Search │ │ Detail │ │
│ └───────────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Analysis & Clustering Engine │
│ • Event clustering algorithm │
│ • Semantic similarity calculation │
│ • Cross-artifact correlation │
│ • Timeline rendering engine │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Qdrant │ │ SQLite │ │ LLM API │ │
│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
```
### Deployment Options
| Configuration | Qdrant | Embedding | LLM | Use Case |
|---------------|--------|-----------|-----|----------|
| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
## Artifact Types
### Searchable & Timeline-Enabled
| Type | Examples | Features |
|------|----------|----------|
| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
| **Documents** | PDF, Word, plain text | Content search, access time correlation |
### Timeline-Only (Context View)
| Type | Examples |
|------|----------|
| **File Events** | File creation, modification, deletion, access |
| **Process Events** | Application launches, process creation |
| **Network Events** | Connections, DNS queries |
| **System Events** | Logs, authentication events, USB connections |
## Technical Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| **UI Framework** | PySide6 (Qt6) | Native desktop application |
| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
| **Vector Database** | Qdrant | Semantic search and similarity |
| **Metadata Storage** | SQLite | Fast local queries, metadata |
| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
| **Language** | Python 3.13+ | Application logic |
## Data Model
### SemeionArtifact (Universal Schema, WiP)
Every artifact—regardless of source—conforms to a unified model:
```bash
┌─────────────────────────────────────────────────────────┐
│ SemeionArtifact │
├─────────────────────────────────────────────────────────┤
│ Identity: id, case_id │
│ Classification: artifact_class, source_platform │
│ Temporal: timestamp (UTC normalized)
│ Actors: [{identifier, display_name, role}]
│ Content: text, semantic_text │
│ Entities: indexed_entities[] (files, IPs, etc)
│ Hierarchy: parent_id, context_group │
│ Location: url, path, title │
│ Embeddings: semantic_vector (768-dim)
│ Source-Specific: message{}, browser{}, email{}, etc │
└─────────────────────────────────────────────────────────┘
```
### Cluster Model (WiP)
```bash
┌─────────────────────────────────────────────────────────┐
│ EventCluster │
├─────────────────────────────────────────────────────────┤
│ id: unique_cluster_id │
│ time_range: (start_timestamp, end_timestamp)
│ events: [artifact_ids]
│ pattern_type: "file_exfiltration" | "communication"
│ confidence: 0.0-1.0 │
│ summary: "USB transfer with deletion"
│ icon: "⚠️" | "💬" | "🌐" | etc │
│ semantic_links: [related_cluster_ids]
└─────────────────────────────────────────────────────────┘
```
## What Semeion Is NOT (yet)
| Not This | Why |
|----------|-----|
| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition |
| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |
## Development Setup
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
### Prerequisites
- Python 3.13+
- [uv](https://github.com/astral-sh/uv) installed
- requirements.txt
### Installation
```bash
git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion
# Create virtual environment
uv venv --python 3.13
# Activate environment
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install dependencies
uv pip install -r requirements.txt -e .
# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and embedding endpoint configurations
```
### Running
```bash
semeion
```
## System Requirements
### Minimum (Remote Processing)
| Resource | Requirement |
|----------|-------------|
| CPU | 4 cores |
| RAM | 8 GB |
| Storage | Minimal (evidence stored elsewhere) |
| GPU | Not required |
| Network | Access to Qdrant and embedding endpoints (if remote) |
### Recommended (Local Processing)
| Resource | Requirement |
|----------|-------------|
| CPU | 8+ cores |
| RAM | 16 GB (32 GB for large cases) |
| Storage | SSD, sufficient for evidence & vectors |
| GPU | Optional (improves embedding speed with local models) |
| Network | Optional (fully offline capable) |
## Project Status
**Current Phase:** MVP Development - Timeline & Core Features
**Roadmap:**
1. ✅ Concept and architecture design
2. ⬜ Core infrastructure
- Unified artifact ingestion (WhatsApp, Chrome)
- SQLite + Qdrant integration
3. ⬜ Timeline visualization
- High-performance rendering (target: <200ms for 100k events)
- Multi-source swim lanes
- Zoom/pan navigation
4. ⬜ Semantic clustering
- Temporal proximity grouping
- Pattern template matching
- Semantic similarity detection
5. ⬜ Communication search
- Multilingual embedding
- Query interpretation
- Search → timeline jump
6. ⬜ Additional parsers (Telegram, Signal, etc.)
7. ⬜ Export and reporting
8. ⬜ Performance optimization & polish
## Why Open Source Matters
**Transparency for Court:**
- Auditable algorithms—no "black box" analysis
- Reproducible results—scientific validation possible
- Peer-reviewed methods—community scrutiny
**Accessibility:**
- Free for budget-constrained labs
- No vendor lock-in
- Community-driven development
**Innovation:**
- Rapid feature development
- Specialized extensions possible
- Academic research enabled
## Branding
**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
## License
BSD 3-Clause
## Contributing
Semeion is in active development. Contributions welcome, especially from:
- Digital forensics practitioners (workflow validation)
- Timeline visualization experts
- Multilingual NLP specialists
- Performance optimization engineers