# semeion ![alt text](resources/title_image.png) ## Overview Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events. > **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?" Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries. An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists. ## Key Features | Feature | Description | |---------|-------------| | **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning | | **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) | | **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation | | **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation | | **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events | | **Event Grouping** | Collapse/expand semantically-related events; discover patterns | | **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments | | **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results | ## The Problem Semeion Solves ### Problematic UX with existing forensic suites Existing forensic tools have timeline interfaces that: - Render slowly - Show flat event lists with no context - Require manual correlation across sources - Have poor visualization and navigation *This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.* **Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately. ### Large Communication Datasets Modern investigations involve: - Seized databases with millions of relevant artifacts - Multilingual content - Slang, code words, and poor machine translation - Hours spent reading irrelevant conversations **Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context. ## How It Works ### Two Entry Points #### **Entry Point 1: Semantic content discovery → timeline analysis** ```bash 1. Investigator searches: "discussions about file transfers" 2. Semeion finds semantically relevant messages 3. Click any result → Timeline centers on that timestamp 4. See all activity ±2 hours: file access, USB connections, deletions 5. Semantic clustering highlights: "File Exfiltration Sequence" pattern ``` #### **Entry Point 2: timestamp → content correlation** ```bash 1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00" 2. Navigate timeline to that timestamp 3. Semantic clustering reveals: - Communication spike 30min before (cluster: "Coordination Activity") - Suspicious file downloads (cluster: "Malware Delivery") - Network connections (cluster: "C2 Communication") - Process executions leading to encryption 4. Complete attack chain visible at a glance ``` ## Semantic Event Clustering **Traditional Timeline View (Overwhelming):** ```bash 14:33:15 - USB device connected 14:33:18 - Chrome visited facebook.com 14:33:22 - File modified: report.docx 14:33:30 - File copied to USB: project_data.zip (2.3 GB) 14:33:35 - File copied to USB: budget.xlsx 14:33:40 - File deleted: project_data.zip 14:33:45 - WhatsApp message sent ...50 more individual events... ``` **Semeion Clustered View (Clear Pattern):** ```bash ┌──────────────────────────────────────────────┐ │ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand │ Pattern: Data Exfiltration Sequence │ │ USB connected → 2 files copied → source deleted │ │ Related: Files mentioned in chat 2min earlier │ └──────────────────────────────────────────────┘ ``` ### How Clustering Works 1. **Temporal Proximity**: Events within configurable time window evaluated together 2. **Semantic Similarity**: Vector embeddings detect conceptually-related events 3. **Entity Linking**: Shared file names, IPs, usernames connect events 4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.) 5. **LLM Summarization** (Optional): Natural language cluster descriptions ## Architecture ### Desktop Application (PySide6) ```bash ┌─────────────────────────────────────────────────────────┐ │ PySide6 Desktop UI │ │ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ Timeline │ │ Semantic │ │ Artifact │ │ │ │ View │ │ Search │ │ Detail │ │ │ └───────────────┘ └──────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Analysis & Clustering Engine │ │ • Event clustering algorithm │ │ • Semantic similarity calculation │ │ • Cross-artifact correlation │ │ • Timeline rendering engine │ └─────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────┐ │ Data Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Qdrant │ │ SQLite │ │ LLM API │ │ │ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────┘ ``` ### Deployment Options | Configuration | Qdrant | Embedding | LLM | Use Case | |---------------|--------|-----------|-----|----------| | **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline | | **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment | | **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality | ## Artifact Types ### Searchable & Timeline-Enabled | Type | Examples | Features | |------|----------|----------| | **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering | | **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation | | **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement | | **Documents** | PDF, Word, plain text | Content search, access time correlation | ### Timeline-Only (Context View) | Type | Examples | |------|----------| | **File Events** | File creation, modification, deletion, access | | **Process Events** | Application launches, process creation | | **Network Events** | Connections, DNS queries | | **System Events** | Logs, authentication events, USB connections | ## Technical Stack | Component | Technology | Purpose | |-----------|------------|---------| | **UI Framework** | PySide6 (Qt6) | Native desktop application | | **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization | | **Vector Database** | Qdrant | Semantic search and similarity | | **Metadata Storage** | SQLite | Fast local queries, metadata | | **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation | | **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation | | **Forensic Parsing** | pytsk3, pyewf | Disk image processing | | **Language** | Python 3.13+ | Application logic | ## Data Model ### SemeionArtifact (Universal Schema, WiP) Every artifact—regardless of source—conforms to a unified model: ```bash ┌─────────────────────────────────────────────────────────┐ │ SemeionArtifact │ ├─────────────────────────────────────────────────────────┤ │ Identity: id, case_id │ │ Classification: artifact_class, source_platform │ │ Temporal: timestamp (UTC normalized) │ │ Actors: [{identifier, display_name, role}] │ │ Content: text, semantic_text │ │ Entities: indexed_entities[] (files, IPs, etc) │ │ Hierarchy: parent_id, context_group │ │ Location: url, path, title │ │ Embeddings: semantic_vector (768-dim) │ │ Source-Specific: message{}, browser{}, email{}, etc │ └─────────────────────────────────────────────────────────┘ ``` ### Cluster Model (WiP) ```bash ┌─────────────────────────────────────────────────────────┐ │ EventCluster │ ├─────────────────────────────────────────────────────────┤ │ id: unique_cluster_id │ │ time_range: (start_timestamp, end_timestamp) │ │ events: [artifact_ids] │ │ pattern_type: "file_exfiltration" | "communication" │ │ confidence: 0.0-1.0 │ │ summary: "USB transfer with deletion" │ │ icon: "⚠️" | "💬" | "🌐" | etc │ │ semantic_links: [related_cluster_ids] │ └─────────────────────────────────────────────────────────┘ ``` ## What Semeion Is NOT (yet) | Not This | Why | |----------|-----| | **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition | | **Reporting tool** | Timeline export for reports, but documentation happens in primary suite | | **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence | ## Development Setup This project uses [uv](https://github.com/astral-sh/uv) for dependency management. ### Prerequisites - Python 3.13+ - [uv](https://github.com/astral-sh/uv) installed - requirements.txt ### Installation ```bash git clone https://git.cc24.dev/mstoeck3/semeion cd semeion # Create virtual environment uv venv --python 3.13 # Activate environment source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows # Install dependencies uv pip install -r requirements.txt -e . # Configure environment cp .env.example .env # Edit .env with your Qdrant and embedding endpoint configurations ``` ### Running ```bash semeion ``` ## System Requirements ### Minimum (Remote Processing) | Resource | Requirement | |----------|-------------| | CPU | 4 cores | | RAM | 8 GB | | Storage | Minimal (evidence stored elsewhere) | | GPU | Not required | | Network | Access to Qdrant and embedding endpoints (if remote) | ### Recommended (Local Processing) | Resource | Requirement | |----------|-------------| | CPU | 8+ cores | | RAM | 16 GB (32 GB for large cases) | | Storage | SSD, sufficient for evidence & vectors | | GPU | Optional (improves embedding speed with local models) | | Network | Optional (fully offline capable) | ## Project Status **Current Phase:** MVP Development - Timeline & Core Features **Roadmap:** 1. ✅ Concept and architecture design 2. ⬜ Core infrastructure - Unified artifact ingestion (WhatsApp, Chrome) - SQLite + Qdrant integration 3. ⬜ Timeline visualization - High-performance rendering (target: <200ms for 100k events) - Multi-source swim lanes - Zoom/pan navigation 4. ⬜ Semantic clustering - Temporal proximity grouping - Pattern template matching - Semantic similarity detection 5. ⬜ Communication search - Multilingual embedding - Query interpretation - Search → timeline jump 6. ⬜ Additional parsers (Telegram, Signal, etc.) 7. ⬜ Export and reporting 8. ⬜ Performance optimization & polish ## Why Open Source Matters **Transparency for Court:** - Auditable algorithms—no "black box" analysis - Reproducible results—scientific validation possible - Peer-reviewed methods—community scrutiny **Accessibility:** - Free for budget-constrained labs - No vendor lock-in - Community-driven development **Innovation:** - Rapid feature development - Specialized extensions possible - Academic research enabled ## Branding **Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis. The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought. ## License BSD 3-Clause ## Contributing Semeion is in active development. Contributions welcome, especially from: - Digital forensics practitioners (workflow validation) - Timeline visualization experts - Multilingual NLP specialists - Performance optimization engineers