From 096b9868a7cb834c1516c35484a75dc74d6eeb9c Mon Sep 17 00:00:00 2001 From: mstoeck3 Date: Wed, 17 Dec 2025 22:08:13 +0100 Subject: [PATCH] rework concept --- README.md | 389 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 236 insertions(+), 153 deletions(-) diff --git a/README.md b/README.md index fb622da..91606bb 100644 --- a/README.md +++ b/README.md @@ -4,189 +4,231 @@ ## Overview -Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest. +Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events. -> **Core Question Semeion Answers:** "Where should I look?" +> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?" -An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite. +Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries. -([Intro Video](https://cloud.cc24.dev/s/pNwNWJE9QkDiX3J)) +An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists. ## Key Features | Feature | Description | |---------|-------------| -| **Natural Language Search** | Query artifacts using plain English instead of keywords | -| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters | -| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution | -| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching | -| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API | -| **Temporal Context View** | See what happened before and after a discovered artifact | -| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept | -| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) | +| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning | +| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) | +| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation | +| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation | +| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events | +| **Event Grouping** | Collapse/expand semantically-related events; discover patterns | +| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments | +| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results | -## Artifact Types (for first PoC) +## The Problem Semeion Solves -### Searchable (Semantic Vector) +### Problematic UX with existing forensic suites -These artifacts are embedded and searchable via natural language: +Existing forensic tools have timeline interfaces that: -| Type | Examples | -|------|----------| -| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database | -| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches | -| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) | -| **Documents** | PDF, Word, plain text | +- Render slowly +- Show flat event lists with no context +- Require manual correlation across sources +- Have poor visualization and navigation +*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.* -### Timeline-Only (Context View) +**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately. -These artifacts appear in the temporal context view but are not semantically searchable: +### Large Communication Datasets -| Type | Examples | -|------|----------| -| **File Events** | File creation, modification, deletion, access (via Sleuthkit) | -| **Process Events** | Application launches, process creation | -| **Network Events** | Connections, DNS queries | -| **Registry Events** | Windows registry modifications | -| **System Events** | Logs, authentication events | +Modern investigations involve: + +- Seized databases with millions of relevant artifacts +- Multilingual content +- Slang, code words, and poor machine translation +- Hours spent reading irrelevant conversations + +**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context. ## How It Works -![resources/workflow.png](resources/workflow.png) +### Two Entry Points + +#### **Entry Point 1: Semantic content discovery → timeline analysis** + +```bash +1. Investigator searches: "discussions about file transfers" +2. Semeion finds semantically relevant messages +3. Click any result → Timeline centers on that timestamp +4. See all activity ±2 hours: file access, USB connections, deletions +5. Semantic clustering highlights: "File Exfiltration Sequence" pattern +``` + +#### **Entry Point 2: timestamp → content correlation** + +```bash +1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00" +2. Navigate timeline to that timestamp +3. Semantic clustering reveals: + - Communication spike 30min before (cluster: "Coordination Activity") + - Suspicious file downloads (cluster: "Malware Delivery") + - Network connections (cluster: "C2 Communication") + - Process executions leading to encryption +4. Complete attack chain visible at a glance +``` + +## Semantic Event Clustering + +**Traditional Timeline View (Overwhelming):** + +```bash +14:33:15 - USB device connected +14:33:18 - Chrome visited facebook.com +14:33:22 - File modified: report.docx +14:33:30 - File copied to USB: project_data.zip (2.3 GB) +14:33:35 - File copied to USB: budget.xlsx +14:33:40 - File deleted: project_data.zip +14:33:45 - WhatsApp message sent +...50 more individual events... +``` + +**Semeion Clustered View (Clear Pattern):** + +```bash +┌──────────────────────────────────────────────┐ +│ 14:33:15-14:33:40 ⚠️ File Transfer (4) │ ← Click to expand +│ Pattern: Data Exfiltration Sequence │ +│ USB connected → 2 files copied → source deleted │ +│ Related: Files mentioned in chat 2min earlier │ +└──────────────────────────────────────────────┘ +``` + +### How Clustering Works + +1. **Temporal Proximity**: Events within configurable time window evaluated together +2. **Semantic Similarity**: Vector embeddings detect conceptually-related events +3. **Entity Linking**: Shared file names, IPs, usernames connect events +4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.) +5. **LLM Summarization** (Optional): Natural language cluster descriptions ## Architecture -### Client-Server Design +### Desktop Application (PySide6) ```bash -┌─────────────────────────────────────────────────────────────────────────┐ -│ │ -│ ┌─────────────────────┐ │ -│ │ PySide6 Client │ │ -│ │ (Investigator │ │ -│ │ Workstation) │ │ -│ └──────────┬──────────┘ │ -│ │ │ -│ ┌───────────────┴───────────────┐ │ -│ │ │ │ -│ ▼ ▼ │ -│ ┌─────────────────────┐ ┌──────────────────────┐ │ -│ │ Qdrant API │ │ LLM API │ │ -│ │ │ │ (OpenAI-compatible)│ │ -│ └─────────────────────┘ └──────────────────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────────┐ +│ PySide6 Desktop UI │ +│ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │ +│ │ Timeline │ │ Semantic │ │ Artifact │ │ +│ │ View │ │ Search │ │ Detail │ │ +│ └───────────────┘ └──────────────┘ └─────────────┘ │ +└─────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────┐ +│ Analysis & Clustering Engine │ +│ • Event clustering algorithm │ +│ • Semantic similarity calculation │ +│ • Cross-artifact correlation │ +│ • Timeline rendering engine │ +└─────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────┐ +│ Data Layer │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Qdrant │ │ SQLite │ │ LLM API │ │ +│ │ (Vectors) │ │ (Metadata) │ │ (Optional) │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└─────────────────────────────────────────────────────────┘ ``` ### Deployment Options -| Configuration | Qdrant | LLM | Use Case | -|---------------|--------|-----|----------| -| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline | -| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases | -| **Hybrid** | Local | Cloud API | Balance of privacy and capability | -| **Full Cloud** | Cloud | Cloud API | Team access, scalability | +| Configuration | Qdrant | Embedding | LLM | Use Case | +|---------------|--------|-----------|-----|----------| +| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline | +| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment | +| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality | -## Use Case Examples +## Artifact Types -### Example 1: Cryptocurrency Investigation +### Searchable & Timeline-Enabled -**Query:** *"Find discussions about buying software with cryptocurrency"* +| Type | Examples | Features | +|------|----------|----------| +| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering | +| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation | +| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement | +| **Documents** | PDF, Word, plain text | Content search, access time correlation | -**Semeion finds:** +### Timeline-Only (Context View) -- Chat messages mentioning crypto payments -- Browser visits to exchange websites -- Wallet-related searches - -**Context view reveals:** - -- File downloads after payment discussions -- Application installations -- Network connections to blockchain services - -### Example 2: Data Exfiltration - -**Query:** *"Messages about sending confidential documents"* - -**Semeion finds:** - -- Emails discussing document sharing -- Chat messages about file transfers - -**Context view reveals:** - -- File access events before discussions -- Cloud storage uploads after discussions -- USB device connections - -### Example 3: Timeline Reconstruction - -**Query:** *"Threatening messages received in March"* - -**Semeion finds:** - -- Messages matching threatening language patterns - -**Context view reveals:** - -- What the recipient searched afterward -- Files accessed or deleted -- Communication with others about the threat +| Type | Examples | +|------|----------| +| **File Events** | File creation, modification, deletion, access | +| **Process Events** | Application launches, process creation | +| **Network Events** | Connections, DNS queries | +| **System Events** | Logs, authentication events, USB connections | ## Technical Stack | Component | Technology | Purpose | |-----------|------------|---------| -| GUI Framework | PySide6 | Desktop application | -| Vector Database | Qdrant | Semantic search and storage | -| LLM Interface | OpenAI-compatible API | Query interpretation | -| Embedding API | OpenAI-compatible API | Vector generation | -| Forensic Parsing | pytsk3, pyewf | Disk image processing | -| Language | Python 3.13 (PySide6-restricted) | Application logic | +| **UI Framework** | PySide6 (Qt6) | Native desktop application | +| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization | +| **Vector Database** | Qdrant | Semantic search and similarity | +| **Metadata Storage** | SQLite | Fast local queries, metadata | +| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation | +| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation | +| **Forensic Parsing** | pytsk3, pyewf | Disk image processing | +| **Language** | Python 3.13+ | Application logic | ## Data Model -### SemeionArtifact (Simplified) +### SemeionArtifact (Universal Schema, WiP) -Every artifact — regardless of source platform — conforms to a universal schema: +Every artifact—regardless of source—conforms to a unified model: ```bash -┌─────────────────────────────────────────────────────────────────────────┐ -│ SemeionArtifact │ -├─────────────────────────────────────────────────────────────────────────┤ -│ │ -│ Identity: id, case_id │ -│ Classification: artifact_class, source_platform, searchable │ -│ Temporal: timestamp │ -│ Actors: [{identifier, display_name, role}] │ -│ Content: text, semantic_text │ -│ Entities: indexed_entities[] (for filtering) │ -│ Hierarchy: parent_id, chunk_info (for documents) │ -│ Context: context_group (conversation, thread, session) │ -│ Location: url, path, title │ -│ Source-Specific: message{}, browser{}, email{}, document{}, etc. │ -│ Ingestion: ingested_at, source_file, parser_id │ -│ │ -└─────────────────────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────────┐ +│ SemeionArtifact │ +├─────────────────────────────────────────────────────────┤ +│ Identity: id, case_id │ +│ Classification: artifact_class, source_platform │ +│ Temporal: timestamp (UTC normalized) │ +│ Actors: [{identifier, display_name, role}] │ +│ Content: text, semantic_text │ +│ Entities: indexed_entities[] (files, IPs, etc) │ +│ Hierarchy: parent_id, context_group │ +│ Location: url, path, title │ +│ Embeddings: semantic_vector (768-dim) │ +│ Source-Specific: message{}, browser{}, email{}, etc │ +└─────────────────────────────────────────────────────────┘ ``` -### Vector Strategy +### Cluster Model (WiP) -| Vector | Purpose | Required | -|--------|---------|----------| -| **Semantic** | Conceptual similarity search | Yes | -| **Sparse** (keywords) | Exact term matching (hybrid) | Optional | - -Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views. +```bash +┌─────────────────────────────────────────────────────────┐ +│ EventCluster │ +├─────────────────────────────────────────────────────────┤ +│ id: unique_cluster_id │ +│ time_range: (start_timestamp, end_timestamp) │ +│ events: [artifact_ids] │ +│ pattern_type: "file_exfiltration" | "communication" │ +│ confidence: 0.0-1.0 │ +│ summary: "USB transfer with deletion" │ +│ icon: "⚠️" | "💬" | "🌐" | etc │ +│ semantic_links: [related_cluster_ids] │ +└─────────────────────────────────────────────────────────┘ +``` ## What Semeion Is NOT (yet) | Not This | Why | |----------|-----| -| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom | -| Reporting Tool | Review and analyse findings, documents in primary application | -| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt | +| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition | +| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite | +| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence | ## Development Setup @@ -196,11 +238,12 @@ This project uses [uv](https://github.com/astral-sh/uv) for dependency managemen - Python 3.13+ - [uv](https://github.com/astral-sh/uv) installed +- requirements.txt ### Installation ```bash -git clone +git clone https://git.cc24.dev/mstoeck3/semeion cd semeion # Create virtual environment @@ -215,13 +258,11 @@ uv pip install -r requirements.txt -e . # Configure environment cp .env.example .env -# Edit .env with your Qdrant and LLM endpoint configurations +# Edit .env with your Qdrant and embedding endpoint configurations ``` ### Running -uv handles the startup script: - ```bash semeion ``` @@ -232,41 +273,83 @@ semeion | Resource | Requirement | |----------|-------------| -| CPU | Multi-core | -| RAM | 4 GB | +| CPU | 4 cores | +| RAM | 8 GB | | Storage | Minimal (evidence stored elsewhere) | -| Network | Access to Qdrant and LLM endpoints | +| GPU | Not required | +| Network | Access to Qdrant and embedding endpoints (if remote) | ### Recommended (Local Processing) | Resource | Requirement | |----------|-------------| | CPU | 8+ cores | -| RAM | 32 GB | -| Storage | sufficient for evidence & vectors, LLM if installed locally | -| GPU | optional (improves embedding speed) | +| RAM | 16 GB (32 GB for large cases) | +| Storage | SSD, sufficient for evidence & vectors | +| GPU | Optional (improves embedding speed with local models) | +| Network | Optional (fully offline capable) | ## Project Status -**Current Phase:** Architecture and data model definition +**Current Phase:** MVP Development - Timeline & Core Features **Roadmap:** -1. ✅ Concept and schema design -2. ⬜ Core infrastructure (Qdrant collection, basic ingestion) -3. ⬜ Search execution (semantic search, filtering) -4. ⬜ LLM integration (query interpretation) -5. ⬜ Refinement system (Qdrant Recommend) -6. ⬜ Context view -7. ⬜ Platform parsers (WhatsApp, Chrome, etc.) -8. ⬜ Hybrid search (sparse vectors) +1. ✅ Concept and architecture design +2. ⬜ Core infrastructure + - Unified artifact ingestion (WhatsApp, Chrome) + - SQLite + Qdrant integration +3. ⬜ Timeline visualization + - High-performance rendering (target: <200ms for 100k events) + - Multi-source swim lanes + - Zoom/pan navigation +4. ⬜ Semantic clustering + - Temporal proximity grouping + - Pattern template matching + - Semantic similarity detection +5. ⬜ Communication search + - Multilingual embedding + - Query interpretation + - Search → timeline jump +6. ⬜ Additional parsers (Telegram, Signal, etc.) +7. ⬜ Export and reporting +8. ⬜ Performance optimization & polish + +## Why Open Source Matters + +**Transparency for Court:** + +- Auditable algorithms—no "black box" analysis +- Reproducible results—scientific validation possible +- Peer-reviewed methods—community scrutiny + +**Accessibility:** + +- Free for budget-constrained labs +- No vendor lock-in +- Community-driven development + +**Innovation:** + +- Rapid feature development +- Specialized extensions possible +- Academic research enabled ## Branding -**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence. +**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis. -The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought. +The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought. ## License BSD 3-Clause + +## Contributing + +Semeion is in active development. Contributions welcome, especially from: + +- Digital forensics practitioners (workflow validation) +- Timeline visualization experts +- Multilingual NLP specialists +- Performance optimization engineers