semeion
Overview
Semeion is a timeline-first digital forensics analysis platform that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
Core Question Semeion Answers: "What happened around this time—and how are these events connected?"
Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "discussions about data transfer" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
Key Features
| Feature | Description |
|---|---|
| Timeline Visualization | Render millions of events with <200ms response time; smooth zooming/panning |
| Semantic Event Clustering | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
| Multilingual Semantic Search | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
| Two-Mode Operation | Semantic content discovery → timeline analysis OR timestamp → content correlation |
| Cross-Source Correlation | Unified timeline across messages, files, network, browser, system events |
| Event Grouping | Collapse/expand semantically-related events; discover patterns |
| Native Desktop Application | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
| Open Source & Transparent | Auditable methods, minimize "blackbox" AI, court-defensible results |
The Problem Semeion Solves
Problematic UX with existing forensic suites
Existing forensic tools have timeline interfaces that:
- Render slowly
- Show flat event lists with no context
- Require manual correlation across sources
- Have poor visualization and navigation This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.
Semeion's Solution: Fast, interactive timeline with semantic clustering that shows patterns immediately.
Large Communication Datasets
Modern investigations involve:
- Seized databases with millions of relevant artifacts
- Multilingual content
- Slang, code words, and poor machine translation
- Hours spent reading irrelevant conversations
Semeion's Solution: Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
How It Works
Two Entry Points
Entry Point 1: Semantic content discovery → timeline analysis
1. Investigator searches: "discussions about file transfers"
2. Semeion finds semantically relevant messages
3. Click any result → Timeline centers on that timestamp
4. See all activity ±2 hours: file access, USB connections, deletions
5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
Entry Point 2: timestamp → content correlation
1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
2. Navigate timeline to that timestamp
3. Semantic clustering reveals:
- Communication spike 30min before (cluster: "Coordination Activity")
- Suspicious file downloads (cluster: "Malware Delivery")
- Network connections (cluster: "C2 Communication")
- Process executions leading to encryption
4. Complete attack chain visible at a glance
Semantic Event Clustering
Traditional Timeline View (Overwhelming):
14:33:15 - USB device connected
14:33:18 - Chrome visited facebook.com
14:33:22 - File modified: report.docx
14:33:30 - File copied to USB: project_data.zip (2.3 GB)
14:33:35 - File copied to USB: budget.xlsx
14:33:40 - File deleted: project_data.zip
14:33:45 - WhatsApp message sent
...50 more individual events...
Semeion Clustered View (Clear Pattern):
┌──────────────────────────────────────────────┐
│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand
│ Pattern: Data Exfiltration Sequence │
│ USB connected → 2 files copied → source deleted │
│ Related: Files mentioned in chat 2min earlier │
└──────────────────────────────────────────────┘
How Clustering Works
- Temporal Proximity: Events within configurable time window evaluated together
- Semantic Similarity: Vector embeddings detect conceptually-related events
- Entity Linking: Shared file names, IPs, usernames connect events
- Pattern Templates: Pre-defined sequences (USB exfiltration, login chains, etc.)
- LLM Summarization (Optional): Natural language cluster descriptions
Architecture
Desktop Application (PySide6)
┌─────────────────────────────────────────────────────────┐
│ PySide6 Desktop UI │
│ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Timeline │ │ Semantic │ │ Artifact │ │
│ │ View │ │ Search │ │ Detail │ │
│ └───────────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Analysis & Clustering Engine │
│ • Event clustering algorithm │
│ • Semantic similarity calculation │
│ • Cross-artifact correlation │
│ • Timeline rendering engine │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Qdrant │ │ SQLite │ │ LLM API │ │
│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Deployment Options
| Configuration | Qdrant | Embedding | LLM | Use Case |
|---|---|---|---|---|
| Fully Local | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
| Airgapped Network | Internal server | Internal server | Internal server | proper forensic environment |
| Hybrid | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
Artifact Types
Searchable & Timeline-Enabled
| Type | Examples | Features |
|---|---|---|
| Messages | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
| Browser | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
| Any email format (EML, MBOX, PST) | Searchable content, timeline placement | |
| Documents | PDF, Word, plain text | Content search, access time correlation |
Timeline-Only (Context View)
| Type | Examples |
|---|---|
| File Events | File creation, modification, deletion, access |
| Process Events | Application launches, process creation |
| Network Events | Connections, DNS queries |
| System Events | Logs, authentication events, USB connections |
Technical Stack
| Component | Technology | Purpose |
|---|---|---|
| UI Framework | PySide6 (Qt6) | Native desktop application |
| Timeline Rendering | PyQtGraph or Plotly | High-performance visualization |
| Vector Database | Qdrant | Semantic search and similarity |
| Metadata Storage | SQLite | Fast local queries, metadata |
| LLM Interface | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
| Embedding | Sentence Transformers / OpenAI | Multilingual vector generation |
| Forensic Parsing | pytsk3, pyewf | Disk image processing |
| Language | Python 3.13+ | Application logic |
Data Model
SemeionArtifact (Universal Schema, WiP)
Every artifact—regardless of source—conforms to a unified model:
┌─────────────────────────────────────────────────────────┐
│ SemeionArtifact │
├─────────────────────────────────────────────────────────┤
│ Identity: id, case_id │
│ Classification: artifact_class, source_platform │
│ Temporal: timestamp (UTC normalized) │
│ Actors: [{identifier, display_name, role}] │
│ Content: text, semantic_text │
│ Entities: indexed_entities[] (files, IPs, etc) │
│ Hierarchy: parent_id, context_group │
│ Location: url, path, title │
│ Embeddings: semantic_vector (768-dim) │
│ Source-Specific: message{}, browser{}, email{}, etc │
└─────────────────────────────────────────────────────────┘
Cluster Model (WiP)
┌─────────────────────────────────────────────────────────┐
│ EventCluster │
├─────────────────────────────────────────────────────────┤
│ id: unique_cluster_id │
│ time_range: (start_timestamp, end_timestamp) │
│ events: [artifact_ids] │
│ pattern_type: "file_exfiltration" | "communication" │
│ confidence: 0.0-1.0 │
│ summary: "USB transfer with deletion" │
│ icon: "⚠️" | "💬" | "🌐" | etc │
│ semantic_links: [related_cluster_ids] │
└─────────────────────────────────────────────────────────┘
What Semeion Is NOT (yet)
| Not This | Why |
|---|---|
| Forensic suite replacement | Companion tool—use alongside Autopsy for acquisition |
| Reporting tool | Timeline export for reports, but documentation happens in primary suite |
| AI evidence interpreter | AI assists with search/clustering; investigator interprets evidence |
Development Setup
This project uses uv for dependency management.
Prerequisites
- Python 3.13+
- uv installed
- requirements.txt
Installation
git clone https://git.cc24.dev/mstoeck3/semeion
cd semeion
# Create virtual environment
uv venv --python 3.13
# Activate environment
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# Install dependencies
uv pip install -r requirements.txt -e .
# Configure environment
cp .env.example .env
# Edit .env with your Qdrant and embedding endpoint configurations
Running
semeion
System Requirements
Minimum (Remote Processing)
| Resource | Requirement |
|---|---|
| CPU | 4 cores |
| RAM | 8 GB |
| Storage | Minimal (evidence stored elsewhere) |
| GPU | Not required |
| Network | Access to Qdrant and embedding endpoints (if remote) |
Recommended (Local Processing)
| Resource | Requirement |
|---|---|
| CPU | 8+ cores |
| RAM | 16 GB (32 GB for large cases) |
| Storage | SSD, sufficient for evidence & vectors |
| GPU | Optional (improves embedding speed with local models) |
| Network | Optional (fully offline capable) |
Project Status
Current Phase: MVP Development - Timeline & Core Features
Roadmap:
- ✅ Concept and architecture design
- ⬜ Core infrastructure
- Unified artifact ingestion (WhatsApp, Chrome)
- SQLite + Qdrant integration
- ⬜ Timeline visualization
- High-performance rendering (target: <200ms for 100k events)
- Multi-source swim lanes
- Zoom/pan navigation
- ⬜ Semantic clustering
- Temporal proximity grouping
- Pattern template matching
- Semantic similarity detection
- ⬜ Communication search
- Multilingual embedding
- Query interpretation
- Search → timeline jump
- ⬜ Additional parsers (Telegram, Signal, etc.)
- ⬜ Export and reporting
- ⬜ Performance optimization & polish
Why Open Source Matters
Transparency for Court:
- Auditable algorithms—no "black box" analysis
- Reproducible results—scientific validation possible
- Peer-reviewed methods—community scrutiny
Accessibility:
- Free for budget-constrained labs
- No vendor lock-in
- Community-driven development
Innovation:
- Rapid feature development
- Specialized extensions possible
- Academic research enabled
Branding
Semeion derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
The mascot, Koios (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
License
BSD 3-Clause
Contributing
Semeion is in active development. Contributions welcome, especially from:
- Digital forensics practitioners (workflow validation)
- Timeline visualization experts
- Multilingual NLP specialists
- Performance optimization engineers
