rework concept

2025-12-17 22:08:13 +01:00
parent ea460864a6
commit 096b9868a7
1 changed files with 236 additions and 153 deletions
--- a/README.md
+++ b/README.md
@@ -4,189 +4,231 @@

 ## Overview

-Semeion is a **semantic search companion tool** for digital forensics investigators. It does not replace traditional forensic suites (Autopsy, Axiom, X-Ways) but augments them by enabling natural language queries across communication artifacts to quickly identify areas of interest.
+Semeion is a **timeline-first digital forensics analysis platform** that solves the two biggest pain points in modern investigations: navigating massive communication datasets and correlating events across data sources, including filesystm artifacts and OS events.

-> **Core Question Semeion Answers:** "Where should I look?"
+> **Core Question Semeion Answers:** "What happened around this time—and how are these events connected?"

-An investigator opens Semeion alongside their primary forensic tool, types a query like *"discussions about cryptocurrency payments"*, reviews ranked results across chat messages, browser history, emails, and documents, then uses those findings to guide deeper analysis in their forensic suite.
+Traditional forensic tools often have unintuitiveur slow user experience (opinionated). Semeion in contrast aims to provide a fast, intelligent timeline that automatically clusters semantically-related events, enabling investigators to see patterns at a glance rather than clicking through endless individual entries.

-([Intro Video](https://cloud.cc24.dev/s/pNwNWJE9QkDiX3J))
+An investigator opens Semeion alongside their primary forensic tool, either searches for relevant content (like "*discussions about data transfer*" in Russian messages) or navigates to a known timestamp (from incident logs), and immediately sees all correlated activity—messages, file access, network connections, system events—organized into meaningful sequences rather than flat lists.

 ## Key Features

 | Feature | Description |
 |---------|-------------|
-| **Natural Language Search** | Query artifacts using plain English instead of keywords |
-| **LLM-Assisted Interpretation** | Queries are parsed by an LLM into structured search parameters |
-| **Human-in-the-Loop Confirmation** | Review and edit interpreted search parameters before execution |
-| **Semantic + (Hybrid Search)** | Combines meaning-based vector search with keyword matching |
-| **Interactive Refinement** | Mark results as relevant [+] or irrelevant [-], then refine via Qdrant's Recommend API |
-| **Temporal Context View** | See what happened before and after a discovered artifact |
-| **Universal Artifact Model** | Platform-agnostic — can adapt to any forensic data from external sources, expandable concept |
-| **Flexible Deployment** | Runs fully local or on airgapped forensic networks (or with cloud infrastructure if anyone would do such a thing) |
+| **Timeline Visualization** | Render millions of events with <200ms response time; smooth zooming/panning |
+| **Semantic Event Clustering** | Automatically group related events into patterns (file transfers, communication bursts, login sequences) |
+| **Multilingual Semantic Search** | Find concepts across languages—search in English, find results in Russian/Chinese/Arabic with translation |
+| **Two-Mode Operation** | Semantic content discovery → timeline analysis OR timestamp → content correlation |
+| **Cross-Source Correlation** | Unified timeline across messages, files, network, browser, system events |
+| **Event Grouping** | Collapse/expand semantically-related events; discover patterns |
+| **Native Desktop Application** | PySide6-based—runs fully local, tailored for forensic-grade airgapped environments |
+| **Open Source & Transparent** | Auditable methods, minimize "blackbox" AI, court-defensible results |

-## Artifact Types (for first PoC)
+## The Problem Semeion Solves

-### Searchable (Semantic Vector)
+### Problematic UX with existing forensic suites

-These artifacts are embedded and searchable via natural language:
+Existing forensic tools have timeline interfaces that:

-| Type | Examples |
-|------|----------|
-| **Messages** | WhatsApp, Telegram, Signal, or anything which has a parseable SQLite database |
-| **Browser Events** | Chrome, Firefox, Safari, Edge — history, downloads, bookmarks, searches |
-| **Email** | Any email files (simplified: sender, receiver, subject, body, timestamp) |
-| **Documents** | PDF, Word, plain text |
+- Render slowly
+- Show flat event lists with no context
+- Require manual correlation across sources
+- Have poor visualization and navigation
+*This is highly opinionated from personal experiences of the maintainer with multiple large commercial suites which werent able to deliver a satisfying result. I wont name any here, and exceptions may exist, but improvements are required here.*

-### Timeline-Only (Context View)
+**Semeion's Solution:** Fast, interactive timeline with semantic clustering that shows patterns immediately.

-These artifacts appear in the temporal context view but are not semantically searchable:
+### Large Communication Datasets

-| Type | Examples |
-|------|----------|
-| **File Events** | File creation, modification, deletion, access (via Sleuthkit) |
-| **Process Events** | Application launches, process creation |
-| **Network Events** | Connections, DNS queries |
-| **Registry Events** | Windows registry modifications |
-| **System Events** | Logs, authentication events |
+Modern investigations involve:
+
+- Seized databases with millions of relevant artifacts
+- Multilingual content
+- Slang, code words, and poor machine translation
+- Hours spent reading irrelevant conversations
+
+**Semeion's Solution:** Semantic search that understands meaning, handles multiple languages, and jumps directly to timeline context.

 ## How It Works

-![resources/workflow.png](resources/workflow.png)
+### Two Entry Points
+
+#### **Entry Point 1: Semantic content discovery → timeline analysis**
+
+```bash
+1. Investigator searches: "discussions about file transfers"
+2. Semeion finds semantically relevant messages
+3. Click any result → Timeline centers on that timestamp
+4. See all activity ±2 hours: file access, USB connections, deletions
+5. Semantic clustering highlights: "File Exfiltration Sequence" pattern
+```
+
+#### **Entry Point 2: timestamp → content correlation**
+
+```bash
+1. Incident log shows: "Ransomware encrypted files at 2024-03-15 14:22:00"
+2. Navigate timeline to that timestamp
+3. Semantic clustering reveals:
+   - Communication spike 30min before (cluster: "Coordination Activity")
+   - Suspicious file downloads (cluster: "Malware Delivery")
+   - Network connections (cluster: "C2 Communication")
+   - Process executions leading to encryption
+4. Complete attack chain visible at a glance
+```
+
+## Semantic Event Clustering
+
+**Traditional Timeline View (Overwhelming):**
+
+```bash
+14:33:15 - USB device connected
+14:33:18 - Chrome visited facebook.com
+14:33:22 - File modified: report.docx
+14:33:30 - File copied to USB: project_data.zip (2.3 GB)
+14:33:35 - File copied to USB: budget.xlsx
+14:33:40 - File deleted: project_data.zip
+14:33:45 - WhatsApp message sent
+...50 more individual events...
+```
+
+**Semeion Clustered View (Clear Pattern):**
+
+```bash
+┌──────────────────────────────────────────────┐
+│ 14:33:15-14:33:40  ⚠️ File Transfer (4)      │ ← Click to expand
+│ Pattern: Data Exfiltration Sequence          │
+│ USB connected → 2 files copied → source deleted │
+│ Related: Files mentioned in chat 2min earlier │
+└──────────────────────────────────────────────┘
+```
+
+### How Clustering Works
+
+1. **Temporal Proximity**: Events within configurable time window evaluated together
+2. **Semantic Similarity**: Vector embeddings detect conceptually-related events
+3. **Entity Linking**: Shared file names, IPs, usernames connect events
+4. **Pattern Templates**: Pre-defined sequences (USB exfiltration, login chains, etc.)
+5. **LLM Summarization** (Optional): Natural language cluster descriptions

 ## Architecture

-### Client-Server Design
+### Desktop Application (PySide6)

 ```bash
-┌─────────────────────────────────────────────────────────────────────────┐
-│                                                                         │
-│                      ┌─────────────────────┐                            │
-│                      │   PySide6 Client    │                            │
-│                      │   (Investigator     │                            │
-│                      │    Workstation)     │                            │
-│                      └──────────┬──────────┘                            │
-│                                 │                                       │
-│                 ┌───────────────┴───────────────┐                       │
-│                 │                               │                       │
-│                 ▼                               ▼                       │
-│      ┌─────────────────────┐       ┌──────────────────────┐             │
-│      │    Qdrant API       │       │   LLM API            │             │
-│      │                     │       │   (OpenAI-compatible)│             │
-│      └─────────────────────┘       └──────────────────────┘             │
-│                                                                         │
-└─────────────────────────────────────────────────────────────────────────┘
+┌─────────────────────────────────────────────────────────┐
+│                   PySide6 Desktop UI                    │
+│  ┌───────────────┐  ┌──────────────┐  ┌─────────────┐  │
+│  │   Timeline    │  │   Semantic   │  │   Artifact  │  │
+│  │     View      │  │    Search    │  │    Detail   │  │
+│  └───────────────┘  └──────────────┘  └─────────────┘  │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│              Analysis & Clustering Engine               │
+│  • Event clustering algorithm                           │
+│  • Semantic similarity calculation                      │
+│  • Cross-artifact correlation                           │
+│  • Timeline rendering engine                            │
+└─────────────────────────────────────────────────────────┘
+                          ↓
+┌─────────────────────────────────────────────────────────┐
+│                   Data Layer                            │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
+│  │   Qdrant     │  │   SQLite     │  │   LLM API    │  │
+│  │   (Vectors)  │  │   (Metadata) │  │   (Optional) │  │
+│  └──────────────┘  └──────────────┘  └──────────────┘  │
+└─────────────────────────────────────────────────────────┘
 ```

 ### Deployment Options

-| Configuration | Qdrant | LLM | Use Case |
-|---------------|--------|-----|----------|
-| **Fully Local** | localhost | Ollama (localhost) | Single investigator, offline |
-| **Airgapped Network** | Internal server | Internal server | Forensic lab, sensitive cases |
-| **Hybrid** | Local | Cloud API | Balance of privacy and capability |
-| **Full Cloud** | Cloud | Cloud API | Team access, scalability |
+| Configuration | Qdrant | Embedding | LLM | Use Case |
+|---------------|--------|-----------|-----|----------|
+| **Fully Local** | localhost | Ollama/local model | Ollama (optional) | Single investigator, offline |
+| **Airgapped Network** | Internal server | Internal server | Internal server | proper forensic environment |
+| **Hybrid** | Local | Local | Cloud API (optional) | investigative environment without focus on confidentiality |

-## Use Case Examples
+## Artifact Types

-### Example 1: Cryptocurrency Investigation
+### Searchable & Timeline-Enabled

-**Query:** *"Find discussions about buying software with cryptocurrency"*
+| Type | Examples | Features |
+|------|----------|----------|
+| **Messages** | WhatsApp, Telegram, Signal, SMS | Semantic search, multilingual, timeline clustering |
+| **Browser** | Chrome, Firefox, Safari, Edge | History, downloads, searches—timeline correlation |
+| **Email** | Any email format (EML, MBOX, PST) | Searchable content, timeline placement |
+| **Documents** | PDF, Word, plain text | Content search, access time correlation |

-**Semeion finds:**
+### Timeline-Only (Context View)

- Chat messages mentioning crypto payments
- Browser visits to exchange websites
- Wallet-related searches
-
-**Context view reveals:**
-
- File downloads after payment discussions
- Application installations
- Network connections to blockchain services
-
-### Example 2: Data Exfiltration
-
-**Query:** *"Messages about sending confidential documents"*
-
-**Semeion finds:**
-
- Emails discussing document sharing
- Chat messages about file transfers
-
-**Context view reveals:**
-
- File access events before discussions
- Cloud storage uploads after discussions
- USB device connections
-
-### Example 3: Timeline Reconstruction
-
-**Query:** *"Threatening messages received in March"*
-
-**Semeion finds:**
-
- Messages matching threatening language patterns
-
-**Context view reveals:**
-
- What the recipient searched afterward
- Files accessed or deleted
- Communication with others about the threat
+| Type | Examples |
+|------|----------|
+| **File Events** | File creation, modification, deletion, access |
+| **Process Events** | Application launches, process creation |
+| **Network Events** | Connections, DNS queries |
+| **System Events** | Logs, authentication events, USB connections |

 ## Technical Stack

 | Component | Technology | Purpose |
 |-----------|------------|---------|
-| GUI Framework | PySide6 | Desktop application |
-| Vector Database | Qdrant | Semantic search and storage |
-| LLM Interface | OpenAI-compatible API | Query interpretation |
-| Embedding API | OpenAI-compatible API | Vector generation |
-| Forensic Parsing | pytsk3, pyewf | Disk image processing |
-| Language | Python 3.13 (PySide6-restricted) | Application logic |
+| **UI Framework** | PySide6 (Qt6) | Native desktop application |
+| **Timeline Rendering** | PyQtGraph or Plotly | High-performance visualization |
+| **Vector Database** | Qdrant | Semantic search and similarity |
+| **Metadata Storage** | SQLite | Fast local queries, metadata |
+| **LLM Interface** | OpenAI-compatible API | Optional: cluster summarization, query interpretation |
+| **Embedding** | Sentence Transformers / OpenAI | Multilingual vector generation |
+| **Forensic Parsing** | pytsk3, pyewf | Disk image processing |
+| **Language** | Python 3.13+ | Application logic |

 ## Data Model

-### SemeionArtifact (Simplified)
+### SemeionArtifact (Universal Schema, WiP)

-Every artifact — regardless of source platform — conforms to a universal schema:
+Every artifact—regardless of source—conforms to a unified model:

 ```bash
-┌─────────────────────────────────────────────────────────────────────────┐
-│  SemeionArtifact                                                        │
-├─────────────────────────────────────────────────────────────────────────┤
-│                                                                         │
-│  Identity:        id, case_id                                           │
-│  Classification:  artifact_class, source_platform, searchable           │
-│  Temporal:        timestamp                                             │
-│  Actors:          [{identifier, display_name, role}]                    │
-│  Content:         text, semantic_text                                   │
-│  Entities:        indexed_entities[] (for filtering)                    │
-│  Hierarchy:       parent_id, chunk_info (for documents)                 │
-│  Context:         context_group (conversation, thread, session)         │
-│  Location:        url, path, title                                      │
-│  Source-Specific: message{}, browser{}, email{}, document{}, etc.       │
-│  Ingestion:       ingested_at, source_file, parser_id                   │
-│                                                                         │
-└─────────────────────────────────────────────────────────────────────────┘
+┌─────────────────────────────────────────────────────────┐
+│  SemeionArtifact                                        │
+├─────────────────────────────────────────────────────────┤
+│  Identity:        id, case_id                           │
+│  Classification:  artifact_class, source_platform       │
+│  Temporal:        timestamp (UTC normalized)            │
+│  Actors:          [{identifier, display_name, role}]    │
+│  Content:         text, semantic_text                   │
+│  Entities:        indexed_entities[] (files, IPs, etc)  │
+│  Hierarchy:       parent_id, context_group              │
+│  Location:        url, path, title                      │
+│  Embeddings:      semantic_vector (768-dim)             │
+│  Source-Specific: message{}, browser{}, email{}, etc    │
+└─────────────────────────────────────────────────────────┘
 ```

-### Vector Strategy
+### Cluster Model (WiP)

-| Vector | Purpose | Required |
-|--------|---------|----------|
-| **Semantic** | Conceptual similarity search | Yes |
-| **Sparse** (keywords) | Exact term matching (hybrid) | Optional |
-
-Timeline-only artifacts use a placeholder vector and are excluded from search but included in context views.
+```bash
+┌─────────────────────────────────────────────────────────┐
+│  EventCluster                                           │
+├─────────────────────────────────────────────────────────┤
+│  id:              unique_cluster_id                     │
+│  time_range:      (start_timestamp, end_timestamp)      │
+│  events:          [artifact_ids]                        │
+│  pattern_type:    "file_exfiltration" | "communication" │
+│  confidence:      0.0-1.0                               │
+│  summary:         "USB transfer with deletion"          │
+│  icon:            "⚠️" | "💬" | "🌐" | etc             │
+│  semantic_links:  [related_cluster_ids]                 │
+└─────────────────────────────────────────────────────────┘
+```

 ## What Semeion Is NOT (yet)

 | Not This | Why |
 |----------|-----|
-| Forensic suite replacement | Companion tool — use alongside Autopsy/Axiom |
-| Reporting Tool | Review and analyse findings, documents in primary application |
-| Forensic Interpretation Robot | Helps you to discover what you otherwise wouldnt |
+| **Forensic suite replacement** | Companion tool—use alongside Autopsy for acquisition |
+| **Reporting tool** | Timeline export for reports, but documentation happens in primary suite |
+| **AI evidence interpreter** | AI assists with search/clustering; investigator interprets evidence |

 ## Development Setup

@@ -196,11 +238,12 @@ This project uses [uv](https://github.com/astral-sh/uv) for dependency managemen

 - Python 3.13+
 - [uv](https://github.com/astral-sh/uv) installed
+- requirements.txt

 ### Installation

 ```bash
-git clone <repository-url>
+git clone https://git.cc24.dev/mstoeck3/semeion
 cd semeion

 # Create virtual environment
@@ -215,13 +258,11 @@ uv pip install -r requirements.txt -e .

 # Configure environment
 cp .env.example .env
-# Edit .env with your Qdrant and LLM endpoint configurations
+# Edit .env with your Qdrant and embedding endpoint configurations
 ```

 ### Running

-uv handles the startup script:
-
 ```bash
 semeion
 ```
@@ -232,41 +273,83 @@ semeion

 | Resource | Requirement |
 |----------|-------------|
-| CPU | Multi-core |
-| RAM | 4 GB |
+| CPU | 4 cores |
+| RAM | 8 GB |
 | Storage | Minimal (evidence stored elsewhere) |
-| Network | Access to Qdrant and LLM endpoints |
+| GPU | Not required |
+| Network | Access to Qdrant and embedding endpoints (if remote) |

 ### Recommended (Local Processing)

 | Resource | Requirement |
 |----------|-------------|
 | CPU | 8+ cores |
-| RAM | 32 GB |
-| Storage | sufficient for evidence & vectors, LLM if installed locally |
-| GPU | optional (improves embedding speed) |
+| RAM | 16 GB (32 GB for large cases) |
+| Storage | SSD, sufficient for evidence & vectors |
+| GPU | Optional (improves embedding speed with local models) |
+| Network | Optional (fully offline capable) |

 ## Project Status

-**Current Phase:** Architecture and data model definition
+**Current Phase:** MVP Development - Timeline & Core Features

 **Roadmap:**

-1. ✅ Concept and schema design
-2. ⬜ Core infrastructure (Qdrant collection, basic ingestion)
-3. ⬜ Search execution (semantic search, filtering)
-4. ⬜ LLM integration (query interpretation)
-5. ⬜ Refinement system (Qdrant Recommend)
-6. ⬜ Context view
-7. ⬜ Platform parsers (WhatsApp, Chrome, etc.)
-8. ⬜ Hybrid search (sparse vectors)
+1. ✅ Concept and architecture design
+2. ⬜ Core infrastructure
+   - Unified artifact ingestion (WhatsApp, Chrome)
+   - SQLite + Qdrant integration
+3. ⬜ Timeline visualization
+   - High-performance rendering (target: <200ms for 100k events)
+   - Multi-source swim lanes
+   - Zoom/pan navigation
+4. ⬜ Semantic clustering
+   - Temporal proximity grouping
+   - Pattern template matching
+   - Semantic similarity detection
+5. ⬜ Communication search
+   - Multilingual embedding
+   - Query interpretation
+   - Search → timeline jump
+6. ⬜ Additional parsers (Telegram, Signal, etc.)
+7. ⬜ Export and reporting
+8. ⬜ Performance optimization & polish
+
+## Why Open Source Matters
+
+**Transparency for Court:**
+
+- Auditable algorithms—no "black box" analysis
+- Reproducible results—scientific validation possible
+- Peer-reviewed methods—community scrutiny
+
+**Accessibility:**
+
+- Free for budget-constrained labs
+- No vendor lock-in
+- Community-driven development
+
+**Innovation:**
+
+- Rapid feature development
+- Specialized extensions possible
+- Academic research enabled

 ## Branding

-**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark" — embodying the software's mission of interpreting semantic signals within digital evidence.
+**Semeion** derives from the Greek σημεῖον (sēmeîon), meaning "sign," "signal," or "meaningful mark"—embodying the software's mission of revealing meaningful patterns within digital evidence through intelligent timeline analysis.

-The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden knowledge through analytical thought.
+The mascot, **Koios** (Coeus), the Titan of intellect and inquiry in Greek mythology, represents the investigative mindset: the pursuit of hidden connections through analytical thought.

 ## License

 BSD 3-Clause
+
+## Contributing
+
+Semeion is in active development. Contributions welcome, especially from:
+
+- Digital forensics practitioners (workflow validation)
+- Timeline visualization experts
+- Multilingual NLP specialists
+- Performance optimization engineers