rework concept

2025-12-17 22:08:13 +01:00
parent ea460864a6
commit 096b9868a7
1 changed files with 236 additions and 153 deletions
--- a/README.md
+++ b/README.md
@@ -4,189 +4,231 @@
 ## Overview
-Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
+Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.
-> **Core Question Semeion Answers:** "Where should I look?"
+> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"
-An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
+Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.
-([Intro Video](https://cloud.cc24.dev/s/pNwNWJE9QkDiX3J))
+An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.
 ## Key Features
 | Feature | Description |
 |---------|-------------|
-| **Natural Language Search** | Query artifacts using plain English instead of keywords |
+| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
-| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters |
+| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
-| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution |
+| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
-| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching |
+| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
-| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
+| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
-| **Temporal Context View** | See what happened before and after a discovered artifact |
+| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
-| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
+| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
-| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
+| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |
-## Artifact Types (for first PoC)
+## The Problem Semeion Solves
-### Searchable (Semantic Vector)
+### Problematic UX with existing forensic suites
-These artifacts are embedded and searchable via natural language:
+Existing forensic tools have timeline interfaces that:
-| Type | Examples |
+- Render slowly
-|------|----------|
+- Show flat event lists with no context
-| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
+- Require manual correlation across sources
-| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
+- Have poor visualization and navigation
-| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) |
+*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*
 | **Documents** | PDF, Word, plain text |
-### Timeline-Only (Context View)
+**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.
-These artifacts appear in the temporal context view but are not semantically searchable:
+### Large Communication Datasets
-| Type | Examples |
+Modern investigations involve:
-|------|----------|
+
-| **File Events** | File creation, modification, deletion, access (via Sleuthkit) |
+- Seized databases with millions of relevant artifacts
-| **Process Events** | Application launches, process creation |
+- Multilingual content
-| **Network Events** | Connections, DNS queries |
+- Slang, code words, and poor machine translation
-| **Registry Events** | Windows registry modifications |
+- Hours spent reading irrelevant conversations
-| **System Events** | Logs, authentication events |
+
 **Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.
 ## How It Works
-![resources/workflow.png](resources/workflow.png)
+### Two Entry Points
 #### **Entry Point 1: Semantic content discovery → timeline analysis**
 ```bash
 1. Investigator searches: "discussions about file transfers"
 2. Semeion finds semantically relevant messages
 3. Click any result → Timeline centers on that timestamp
 4. See all activity ±2 hours: file access, USB connections, deletions
 5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
 ```
 #### **Entry Point 2: timestamp → content correlation**
 ```bash
 1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
 2. Navigate timeline to that timestamp
 3. Semantic clustering reveals:
   - Communication spike 30min before (cluster: "Coordination Activity")
   - Suspicious file downloads (cluster: "Malware Delivery")
   - Network connections (cluster: "C2 Communication")
   - Process executions leading to encryption
 4. Complete attack chain visible at a glance
 ```
 ## Semantic Event Clustering
 **Traditional Timeline View (Overwhelming):**
 ```bash
 14:33:15 - USB device connected
 14:33:18 - Chrome visited facebook.com
 14:33:22 - File modified: report.docx
 14:33:30 - File copied to USB: project_data.zip (2.3 GB)
 14:33:35 - File copied to USB: budget.xlsx
 14:33:40 - File deleted: project_data.zip
 14:33:45 - WhatsApp message sent
 ...50 more individual events...
 ```
 **Semeion Clustered View (Clear Pattern):**
 ```bash
 ┌──────────────────────────────────────────────┐
 │ 14:33:15-14:33:40  ⚠️ File Transfer (4)      │ ← Click to expand
 │ Pattern: Data Exfiltration Sequence          │
 │ USB connected → 2 files copied → source deleted │
 │ Related: Files mentioned in chat 2min earlier │
 └──────────────────────────────────────────────┘
 ```
 ### How Clustering Works
 1. **Temporal Proximity**: Events within configurable time window evaluated together
 2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
 3. **Entity Linking**: Shared file names, IPs, usernames connect events
 4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
 5. **LLM Summarization** (Optional): Natural language cluster descriptions
 ## Architecture
-### Client-Server Design
+### Desktop Application (PySide6)
 ```bash
-┌─────────────────────────────────────────────────────────────────────────┐
+┌─────────────────────────────────────────────────────────┐
-│                                                                         │
+│                   PySide6 Desktop UI                    │
-│                      ┌─────────────────────┐                            │
+│  ┌───────────────┐  ┌──────────────┐  ┌─────────────┐  │
-│                      │   PySide6 Client    │                            │
+│  │   Timeline    │  │   Semantic   │  │   Artifact  │  │
-│                      │   (Investigator     │                            │
+│  │     View      │  │    Search    │  │    Detail   │  │
-│                      │    Workstation)     │                            │
+│  └───────────────┘  └──────────────┘  └─────────────┘  │
-│                      └──────────┬──────────┘                            │
+└─────────────────────────────────────────────────────────┘
-│                                 │                                       │
+                          ↓
-│                 ┌───────────────┴───────────────┐                       │
+┌─────────────────────────────────────────────────────────┐
-│                 │                               │                       │
+│              Analysis & Clustering Engine               │
-│                 ▼                               ▼                       │
+│  • Event clustering algorithm                           │
-│      ┌─────────────────────┐       ┌──────────────────────┐             │
+│  • Semantic similarity calculation                      │
-│      │    Qdrant API       │       │   LLM API            │             │
+│  • Cross-artifact correlation                           │
-│      │                     │       │   (OpenAI-compatible)│             │
+│  • Timeline rendering engine                            │
-│      └─────────────────────┘       └──────────────────────┘             │
+└─────────────────────────────────────────────────────────┘
-│                                                                         │
+                          ↓
-└─────────────────────────────────────────────────────────────────────────┘
+┌─────────────────────────────────────────────────────────┐
 │                   Data Layer                            │
 │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
 │  │   Qdrant     │  │   SQLite     │  │   LLM API    │  │
 │  │   (Vectors)  │  │   (Metadata) │  │   (Optional) │  │
 │  └──────────────┘  └──────────────┘  └──────────────┘  │
 └─────────────────────────────────────────────────────────┘
 ```
 ### Deployment Options
-| Configuration | Qdrant | LLM | Use Case |
+| Configuration | Qdrant | Embedding | LLM | Use Case |
-|---------------|--------|-----|----------|
+|---------------|--------|-----------|-----|----------|
-| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline |
+| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
-| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases |
+| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
-| **Hybrid** | Local | Cloud API | Balance of privacy and capability |
+| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |
 | **Full Cloud** | Cloud | Cloud API | Team access, scalability |
-## Use Case Examples
+## Artifact Types
-### Example 1: Cryptocurrency Investigation
+### Searchable & Timeline-Enabled
-**Query:** *"Find discussions about buying software with cryptocurrency"*
+| Type | Examples | Features |
 |------|----------|----------|
 | **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
 | **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
 | **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
 | **Documents** | PDF, Word, plain text | Content search, access time correlation |
-**Semeion finds:**
+### Timeline-Only (Context View)
- Chat messages mentioning crypto payments
+| Type | Examples |
- Browser visits to exchange websites
+|------|----------|
- Wallet-related searches
+| **File Events** | File creation, modification, deletion, access |
-
+| **Process Events** | Application launches, process creation |
-**Context view reveals:**
+| **Network Events** | Connections, DNS queries |
-
+| **System Events** | Logs, authentication events, USB connections |
 - File downloads after payment discussions
 - Application installations
 - Network connections to blockchain services
 ### Example 2: Data Exfiltration
 **Query:** *"Messages about sending confidential documents"*
 **Semeion finds:**
 - Emails discussing document sharing
 - Chat messages about file transfers
 **Context view reveals:**
 - File access events before discussions
 - Cloud storage uploads after discussions
 - USB device connections
 ### Example 3: Timeline Reconstruction
 **Query:** *"Threatening messages received in March"*
 **Semeion finds:**
 - Messages matching threatening language patterns
 **Context view reveals:**
 - What the recipient searched afterward
 - Files accessed or deleted
 - Communication with others about the threat
 ## Technical Stack
 | Component | Technology | Purpose |
 |-----------|------------|---------|
-| GUI Framework | PySide6 | Desktop application |
+| **UI Framework** | PySide6 (Qt6) | Native desktop application |
-| Vector Database | Qdrant | Semantic search and storage |
+| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
-| LLM Interface | OpenAI-compatible API | Query interpretation |
+| **Vector Database** | Qdrant | Semantic search and similarity |
-| Embedding API | OpenAI-compatible API | Vector generation |
+| **Metadata Storage** | SQLite | Fast local queries, metadata |
-| Forensic Parsing | pytsk3, pyewf | Disk image processing |
+| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
-| Language | Python 3.13 (PySide6-restricted) | Application logic |
+| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
 | **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
 | **Language** | Python 3.13+ | Application logic |
 ## Data Model
-### SemeionArtifact (Simplified)
+### SemeionArtifact (Universal Schema, WiP)
-Every artifact — regardless of source platform — conforms to a universal schema:
+Every artifact—regardless of source—conforms to a unified model:
 ```bash
-┌─────────────────────────────────────────────────────────────────────────┐
+┌─────────────────────────────────────────────────────────┐
 │  SemeionArtifact                                        │
-├─────────────────────────────────────────────────────────────────────────┤
+├─────────────────────────────────────────────────────────┤
 │                                                                         │
 │  Identity:        id, case_id                           │
-│  Classification:  artifact_class, source_platform, searchable           │
+│  Classification:  artifact_class, source_platform       │
-│  Temporal:        timestamp                                             │
+│  Temporal:        timestamp (UTC normalized)            │
 │  Actors:          [{identifier, display_name, role}]    │
 │  Content:         text, semantic_text                   │
-│  Entities:        indexed_entities[] (for filtering)                    │
+│  Entities:        indexed_entities[] (files, IPs, etc)  │
-│  Hierarchy:       parent_id, chunk_info (for documents)                 │
+│  Hierarchy:       parent_id, context_group              │
 │  Context:         context_group (conversation, thread, session)         │
 │  Location:        url, path, title                      │
-│  Source-Specific: message{}, browser{}, email{}, document{}, etc.       │
+│  Embeddings:      semantic_vector (768-dim)             │
-│  Ingestion:       ingested_at, source_file, parser_id                   │
+│  Source-Specific: message{}, browser{}, email{}, etc    │
-│                                                                         │
+└─────────────────────────────────────────────────────────┘
 └─────────────────────────────────────────────────────────────────────────┘
 ```
-### Vector Strategy
+### Cluster Model (WiP)
-| Vector | Purpose | Required |
+```bash
-|--------|---------|----------|
+┌─────────────────────────────────────────────────────────┐
-| **Semantic** | Conceptual similarity search | Yes |
+│  EventCluster                                           │
-| **Sparse** (keywords) | Exact term matching (hybrid) | Optional |
+├─────────────────────────────────────────────────────────┤
-
+│  id:              unique_cluster_id                     │
-Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
+│  time_range:      (start_timestamp, end_timestamp)      │
 │  events:          [artifact_ids]                        │
 │  pattern_type:    "file_exfiltration" | "communication" │
 │  confidence:      0.0-1.0                               │
 │  summary:         "USB transfer with deletion"          │
 │  icon:            "⚠️" | "💬" | "🌐" | etc             │
 │  semantic_links:  [related_cluster_ids]                 │
 └─────────────────────────────────────────────────────────┘
 ```
 ## What Semeion Is NOT (yet)
 | Not This | Why |
 |----------|-----|
-| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom |
+| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition |
-| Reporting Tool | Review and analyse findings, documents in primary application |
+| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
-| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
+| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |
 ## Development Setup
@@ -196,11 +238,12 @@ This project uses [uv](https://github.com/astral-sh/uv) for dependency managemen
 - Python 3.13+
 - [uv](https://github.com/astral-sh/uv) installed
 - requirements.txt
 ### Installation
 ```bash
-git clone <repository-url>
+git clone https://git.cc24.dev/mstoeck3/semeion
 cd semeion
 # Create virtual environment
@@ -215,13 +258,11 @@ uv pip install -r requirements.txt -e .
 # Configure environment
 cp .env.example .env
-# Edit .env with your Qdrant and LLM endpoint configurations
+# Edit .env with your Qdrant and embedding endpoint configurations
 ```
 ### Running
 uv handles the startup script:
 ```bash
 semeion
 ```
@@ -232,41 +273,83 @@ semeion
 | Resource | Requirement |
 |----------|-------------|
-| CPU | Multi-core |
+| CPU | 4 cores |
-| RAM | 4 GB |
+| RAM | 8 GB |
 | Storage | Minimal (evidence stored elsewhere) |
-| Network | Access to Qdrant and LLM endpoints |
+| GPU | Not required |
 | Network | Access to Qdrant and embedding endpoints (if remote) |
 ### Recommended (Local Processing)
 | Resource | Requirement |
 |----------|-------------|
 | CPU | 8+ cores |
-| RAM | 32 GB |
+| RAM | 16 GB (32 GB for large cases) |
-| Storage | sufficient for evidence & vectors, LLM if installed locally |
+| Storage | SSD, sufficient for evidence & vectors |
-| GPU | optional (improves embedding speed) |
+| GPU | Optional (improves embedding speed with local models) |
 | Network | Optional (fully offline capable) |
 ## Project Status
-**Current Phase:** Architecture and data model definition
+**Current Phase:** MVP Development - Timeline & Core Features
 **Roadmap:**
-1. ✅ Concept and schema design
+1. ✅ Concept and architecture design
-2. ⬜ Core infrastructure (Qdrant collection, basic ingestion)
+2. ⬜ Core infrastructure
-3. ⬜ Search execution (semantic search, filtering)
+   - Unified artifact ingestion (WhatsApp, Chrome)
-4. ⬜ LLM integration (query interpretation)
+   - SQLite + Qdrant integration
-5. ⬜ Refinement system (Qdrant Recommend)
+3. ⬜ Timeline visualization
-6. ⬜ Context view
+   - High-performance rendering (target: <200ms for 100k events)
-7. ⬜ Platform parsers (WhatsApp, Chrome, etc.)
+   - Multi-source swim lanes
-8. ⬜ Hybrid search (sparse vectors)
+   - Zoom/pan navigation
 4. ⬜ Semantic clustering
   - Temporal proximity grouping
   - Pattern template matching
   - Semantic similarity detection
 5. ⬜ Communication search
   - Multilingual embedding
   - Query interpretation
   - Search → timeline jump
 6. ⬜ Additional parsers (Telegram, Signal, etc.)
 7. ⬜ Export and reporting
 8. ⬜ Performance optimization & polish
 ## Why Open Source Matters
 **Transparency for Court:**
 - Auditable algorithms—no "black box" analysis
 - Reproducible results—scientific validation possible
 - Peer-reviewed methods—community scrutiny
 **Accessibility:**
 - Free for budget-constrained labs
 - No vendor lock-in
 - Community-driven development
 **Innovation:**
 - Rapid feature development
 - Specialized extensions possible
 - Academic research enabled
 ## Branding
-**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.
+**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.
-The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
+The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.
 ## License
 BSD 3-Clause
 ## Contributing
 Semeion is in active development. Contributions welcome, especially from:
 - Digital forensics practitioners (workflow validation)
 - Timeline visualization experts
 - Multilingual NLP specialists
 - Performance optimization engineers